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One  way  to  improve  the  capability  of  the  neural  network  is  to  incorporate  human 
knowledge.  Knowledge-based  neural  networks  can  utilize  both  the  knowledge  from 
human  experts  and  the  knowledge  inherent  in  the  data.  Neural  networks  mapped 
from  a set  of  production  rules  are  referred  to  as  rule-based  neural  networks.  Fuzziness 
and  context  are  two  distinct  aspects  of  the  knowledge  of  human  experts.  However, 
previous  research  on  rule-based  neural  networks  did  not  consider  these  two  important 
aspects.  Thus,  a fuzzification  layer  is  proposed  to  be  added  to  the  rule-based  neural 
network  for  encoding  the  fuzziness  of  continuous  input  variables.  The  fuzzification 
layer  can  be  superimposed  onto  any  back-propagation  networks.  Moreover,  a neural 
model  with  an  adaptive  memory  structure  follows  the  rule-based  neural  network  in 


IX 


order  to  perform  context  processing.  The  adaptive  memory  is  an  extension  of  the 
gamma  model  (a  neural  model  for  temporal  processing)  for  learning  both  the  right 
and  left  contexts.  A hybrid  information  processing  system  is  constructed  based  on 
the  above-mentioned  neural  models,  and  is  evaluated  on  the  electroencephalogram 
(EEG)  sleep  staging  problem.  Different  experiments  are  designed  to  test  the  system. 
Compared  with  the  previous  results  on  the  same  sleep  records,  the  present  results 
demonstrate  the  following  advantages.  First,  knowledge-based  neural  networks  out- 
perform both  symbolic  knowledge-based  systems  and  traditional  neural  networks. 
Secondly,  fuzzy  encoding  of  continuous  data  gives  better  performance  than  crispy 
data  representation.  Thirdly,  context  processing  effectively  enhances  the  results  of 
information  processing.  The  developed  system  is  promising  for  sleep  staging,  but  the 
design  principles  are  generally  applicable. 


CHAPTER  1 
INTRODUCTION 


1.1  Background 

1.1.1  Artificial  Neural  Networks 

Inspired  from  biological  neural  systems,  scientists  have  been  trying  to  construct 
artificial  neural  networks  which  would  mimic  human  intelligent  behavior.  Artificial 
neural  networks  (or  simply  neural  networks)  also  have  other  names  such  as  connec- 
tionist  models  and  parallel  distributed  processing  models.  Neural  networks  are  a 
group  of  processing  elements  (or  artificial  neurons)  with  weighted  interconnections. 
Each  processing  element  has  an  activation  function  for  combining  the  incoming  ac- 
tivations and  producing  an  outgoing  activation.  Neural  networks  improve  their  per- 
formance by  adjusting  the  connection  weights.  This  process  is  called  training  or 
learning. 

In  recent  years,  neural  networks  have  been  applied  to  a variety  of  problems,  e.g., 
nonlinear  mappings,  pattern  classification,  associative  memory,  self-organization,  fea- 
ture extraction,  and  optimization,  and  proved  to  be  very  useful.  For  an  introduction 
to  neural  networks,  the  readers  are  referred  to  Lippmann  [1987],  Hecht-Nielsen  [1990], 
and  Hertz  et  al.  [1991].  Also,  there  are  books  relating  neural  networks  to  other  re- 
search areas,  e.g.,  fuzzy  systems  [Kosko,  1992a],  knowledge-based  systems  [Fu,  1994], 
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control  systems  [Miller  et  al.,  1990],  signal  processing  [Kosko,  1992b],  and  pattern 
recognition  [Pao,  1989]. 

Neural  networks  can  be  classified  by  network  topology,  processing  elements  char- 
acteristics (the  activation  function),  and  the  training  algorithm.  By  the  network 
topology,  neural  networks  can  be  generally  categorized  into  feedforward  networks 
and  recurrent  networks.  Strictly  feedforward  networks  do  not  have  feedback  (or  re- 
cursive) connections,  while  recurrent  networks  do.  Figure  1.1  illustrates  these  two 
different  categories  of  neural  networks. 

Figure  1.1(a)  is  a multilayer  perceptron  (MLP)  with  two  hidden  layers,  which  is 
the  typical  form  of  feedforward  nets.  The  MLP  has  a clear  layer-by-layer  structure 
and  every  processing  element  connects  to  all  the  processing  elements  of  its  adjacent 
layers.  Figure  1.1(b)  is  a fully  recurrent  network.  Some  processing  elements  of  the  net 
accept  external  inputs  and  some  processing  elements  produce  the  network  outputs. 
Each  processing  element  connects  to  all  the  other  ones.  With  the  feedback  loops, 
recurrent  networks  gain  more  processing  power.  But  training  of  recurrent  networks 
is  more  difficult. 

The  most  general  processing  element  (node  or  unit,  hereafter)  has  an  activation 
function  which  sums  up  the  weighted  inputs  and  passes  the  sum  through  a nonlinear 
function.  The  sigmoid  function  is  the  commonly  used  nonlinear  function  for  neural 
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/•M)  = 1 + e-net  (L1) 

where  net  is  the  sum  of  the  weighted  inputs. 

Training  is  a major  issue  in  neural  networks.  Back-propagation  [Rumelhart  et  al., 
1986]  is  currently  the  most  important  training  algorithm  for  feedforward  networks. 
It  will  be  introduced  in  the  next  subsection.  Two  training  algorithms  for  recurrent 
networks  are  (1)  back-propagation  through  time  [Werbos,  1990]  and  (2)  real-time 
recurrent  learning  [Williams  and  Zipser,  1989].  The  training  procedure  for  recur- 
rent networks  is  much  more  complicated.  Thus,  feedforward  networks  and  partially 
recurrent  networks  which  have  a feedforward  structure  and  a few  selected  feedback 
connections  are  favored  over  fully  recurrent  networks. 

1,1.2  Back-Propagation  Neural  Networks 

The  neural  network  was  doomed  by  Minsky  and  Papert’s  famous  book  - The 
Perceptron  [Minsky  and  Papert,  1969].  The  reasons  were  that  the  perceptron  (single- 
layer network)  can  only  linearly  divide  the  pattern  space.  It  cannot  solve  some  simple 
problems,  e.g.,  the  exclusive-or  (XOR)  problem.  Multilayer  networks  are  capable  of 
learning  this  nonlinear  mapping.  However,  lack  of  proper  learning  scheme  for  multi- 
layer networks  led  neural  network  research  to  a dark  age  for  about  two  decades  before 
the  error  back-propagation  (or  simply  back-propagation)  training  algorithm  [Rumel- 
hart et  al.,  1986]  was  proved  to  be  powerful.  The  same  algorithm  was  independently 
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discovered  in  Bryson  and  Ho  [1969]  and  Werbos  [1974].  Back' propagation  is  simply 
based  on  gradient  descent.  It  does  not  guarantee  convergence  for  multilayer  percep- 
trons,  but  it  has  been  shown  to  be  very  successful  in  many  applications.  Simulated 
Annealing  is  a stochastic  optimization  method  which  can  find  global  minimum  of  the 
error  surface  in  weight  space  [Aarts,  1989].  But  it  is  very  slow  and  often  viewed  as 
impractical.  Feedforward  networks  with  back-propagation  learning  are  referred  to  as 
back-propagation  neural  networks. 

It  has  been  proved  that  a three-layer  network  can  approximate  any  arbitrary 
mapping  function  [Hecht-Nielsen,  1987].  Although  it  is  unknown  if  any  method  exists 
for  construction  of  such  a net  for  a given  function,  back- propagation  has  provided  a 
means  for  weight  adjustment  to  achieve  sufficiently  good  performance. 

In  the  single  layer  network,  the  weights  of  the  connections  are  updated  according 
to  the  error,  defined  by  the  difference  between  the  actual  output  and  the  desired 
output,  of  the  output  unit.  In  the  multilayer  network,  no  such  error  measures  are 
available  for  the  hidden  units.  The  back- propagation  propagates  the  errors  from  the 
output  layer  backward  to  the  input  layer  and  provides  the  estimates  of  the  errors  in 
the  hidden  units. 

We  can  define  an  overall  measure  of  the  error  ( E ) for  evaluating  the  network 
performance  as 


£ = 5 »p, *)! 


(1.2) 
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where  dPik  is  the  desired  or  target  output,  op ^ is  the  actual  output,  p denotes  the 
index  of  the  training  patterns,  and  k denotes  the  numbering  of  the  output  units.  To 
implement  a gradient  decent  in  E,  we  need  to  find  the  derivative  of  E with  respective 
to  each  weight  wj{.  The  derivative  can  be  calculated  by  applying  the  chain  rule. 
The  weight  modification  is  proportional  to  the  negation  of  the  derivative  in  order  to 
minimize  E in  weight  space.  The  weight  adaptation  equation  is 

d E 

ApWji  = -T]  -—A-  + a A p-iWji  (1.3) 

owji 

where  rj  is  the  learning  rate  or  step  size  and  a is  a constant  which  determines  the  effect 
of  past  weight  modifications  on  the  current  weight  modification.  The  value  of  a is  in 
the  region  of  (0,  1).  To  ensure  smooth  convergence  at  a reasonable  speed,  the  learning 
rate  needs  to  be  carefully  selected.  Unfortunately,  the  selection  is  always  empirical. 
So  several  tries-and-errors  are  sometimes  necessary  for  a new  problem.  The  second 
term  on  the  right-hand  side  of  Eq.  1.3  is  called  the  momentum  term  which  is  added 
to  speed  up  the  learning  without  causing  oscillation.  Another  important  point  about 
using  back-propagation  is  that  the  activation  function  needs  to  be  differentiable.  A 
detailed  derivation  of  back-propagation  can  be  found  in  Rumelhart  et  al.  [1986].  And 
a good  summary  is  delineated  in  Lippmann  [1987]. 

1.1.3  Neural  Networks  in  Signal  and  Information  Processing 

Signal  processing  usually  involves  improvement  of  signal  quality,  e.g.,  noise  re- 
duction, and  extraction  of  information  (signal  detection).  Of  course,  improvement  of 
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signal  quality  always  helps  in  extracting  information.  Sometimes,  the  extracted  in- 
formation needs  further  analysis  to  give  clearer  meanings.  This  is  called  information 
processing. 

Signal  processing  is  based  on  the  theory  of  linear  time-invariant  (LTI)  systems. 
Nonlinear  systems  are  linearized  so  that  the  LTI  system  theory  can  be  applied.  As 
mentioned  in  the  previous  subsection,  neural  networks  are  a universal  mapper.  They 
can  approximate  any  mapping  function,  linear  or  nonlinear,  without  involving  com- 
plex mathematics.  Thus,  neural  networks  have  advantages  over  traditional  signal 
processing  methods  on  nonlinear  problems. 

The  human  brain  is  an  excellent  information  processor.  Neural  networks  possess 
some  simple  characteristics  of  the  brain.  They  can  solve  problems  through  self- 
learning  and  self-organization.  There  are  some  computational  advantages  offered  by 
neural  networks.  First,  neural  networks  can  extract,  abstract,  and  generalize  statis- 
tical properties  from  data.  They  can  acquire  knowledge  under  noisy  and  uncertain 
environments.  Secondly,  neural  networks  provide  flexible  knowledge  representation. 
They  can  create  their  own  representation  by  self-organization.  Thirdly,  neural  net- 
works achieve  fast  speed  via  massive  parallelism.  Information  processing  by  neural 
networks  is  more  efficient.  Fourthly,  neural  networks  have  a great  fault-tolerance 
capability.  The  system  performance  degrades  gracefully  to  errors. 
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1.2  Motivations 

1.2.1  Technology  Integration 

The  most  successful  application  of  conventional  (or  symbolic)  artificial  intelli- 
gence (AI)  is  the  knowledge-based  (or  expert)  system  which  was  very  popular  in  the 
1980s.  In  recent  years,  many  techniques  go  beyond  symbolic  AI  and  result  in  a new 
research  direction  named  computational  intelligence  (Cl)  [Marks,  1993].  Computa- 
tional intelligence  depends  fully  on  low-level  numerical  data  and  does  not  rely  on 
“knowledge  tidbits.”  Knowledge  tidbits  are  formulated  human  knowledge,  which  are 
used  in  symbolic  AI.  Knowledge  tidbits  are  only  “pieces  of  relevant  information,  but 
not  the  whole  story”  [Bezdec,  1992]. 

Neural  networks,  fuzzy  systems,  and  genetic  algorithms  are  the  main  areas  in 
computational  intelligence,  which  attract  numerous  researchers.  Meanwhile,  the  in- 
tegration of  these  techniques  along  with  knowledge- based  systems  is  of  great  interest 
since  each  technique  has  its  brittleness.  For  example,  neural  network  training  re- 
quires sample  data  which  cover  the  sample  space  well;  knowledge-based  systems  need 
correct  and  consistent  human  knowledge  which  sometimes  is  hard  to  find;  and  the 
success  of  fuzzy  systems  depends  on  the  heuristically-determined  fuzzy  membership 
functions.  Thus,  it  is  desired  that  these  techniques  can  be  integrated  into  one  frame- 
work and  their  respective  merits  can  be  preserved.  In  this  work,  the  focus  is  on  the 
neural  network  into  which  other  techniques  are  incorporated. 
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1.2.2  Knowledge-Based  Neural  Networks 

Human  knowledge  can  be  formulated  in  the  format  of  production  rules.  The 
knowledge-based  system  inferences  through  its  rule  base  constituted  by  a set  of  pro- 
duction rules.  In  contrast,  the  neural  network  acquires  its  knowledge  through  training 
on  the  data.  A neural  network  without  proper  training  would  have  no  power  at  all. 
It  is  desirable  that  these  two  techniques  can  be  integrated  so  that  both  sources  of 
knowledge  can  be  utilized  together.  One  major  area  of  research  on  combining  these 
two  techniques  is  knowledge-based  neural  networks  which  deal  with  transformation 
of  domain  knowledge  into  a neural  network  structure  [Fu,  1993]. 

Knowledge-based  neural  networks  include  rule-based  neural  networks,  decision 
tree-based  neural  networks,  and  semantic-constraint-based  neural  networks,  which 
differ  in  their  formats  of  knowledge  [Fu,  1994].  A rule-based  neural  network  (RBNN) 
directly  maps  a set  of  production  rules  into  a neural  net  structure.  Hence,  the  RBNN 
acquires  its  expert  knowledge  in  the  form  of  rules  right  at  its  initial  construction, 
and  it  has  learning  power  to  improve  the  knowledge  if  appropriate  data  are  avail- 
able. Towell  et  al.  [1990]  show  that  the  RBNN  could  outperform  the  standard 
back-propagation  networks.  Fu  [1993]  describes  how  domain  knowledge  can  be  re- 
fined by  the  RBNN.  Yet,  two  major  aspects  of  human  knowledge  are  missing  in  the 
RBNN  framework,  which  are  the  fuzzy  nature  of  human  knowledge  expression  and 
the  context  relating  the  environment  to  the  current  event. 
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1.2.3  Fuzziness  and  Context 

As  mentioned  in  the  previous  subsection,  fuzziness  and  context  are  two  aspects  of 
human  knowledge,  which  are  important  and  should  be  incorporated  into  the  frame- 
work of  the  rule-based  neural  network.  By  so-doing,  different  aspects  of  human 
knowledge  can  be  thoroughly  considered  along  with  the  learning  power  of  neural  net- 
works. Here,  I briefly  introduce  the  concepts  of  fuzziness  and  context.  They  will  be 
discussed  in  detail  in  Chapters  3 and  4,  respectively. 

Fuzzy  logic  is  a branch  of  mathematics  which  allows  the  modeling  of  the  real 
world  in  the  way  people  see  that  world.  When  we  say  that  a dog  is  “very”  or 
“somewhat”  pretty,  we  give  different  degrees  of  prettiness  to  the  dog.  All  dogs  are  not 
simply  pretty  or  not  pretty  in  people’s  eyes.  Traditional  set  theory  cannot  represent 
this  phenomenon  adequately.  On  the  other  hand,  a fuzzy  set  can  easily  represent 
continuous  phenomena,  which  are  not  easily  broken  down  into  discrete  segments,  by 
using  fuzzy  membership  functions. 

When  an  expert  exercises  his  knowledge  on  a real  world  problem,  the  rules  are 
often  fuzzy.  People  tend  to  use  linguistic  terms  instead  of  precise  numerical  values. 
Fuzzy  rules  are  written  in  terms  of  imprecise  ideas  of  what  constitutes  the  states  of 
a variable.  Fuzzy  rules  are  more  natural  and  expressive.  One  fuzzy  rule  can  replace 
many  conventional  rules,  and  it  is  closer  to  the  way  that  human  experts  represent 
their  knowledge  [Zimmermann,  1991;  Cox,  1992].  Fuzzy  expert  systems  have  been 
widely  used  in  many  applications.  Fuzzy  neural  nets  or  even  fuzzy  expert  nets  would 
be  desired  for  the  purpose  of  integrating  these  techniques. 


10 


Another  major  issue  in  this  research  is  the  context  information  which  can  exist 
in  time  as  of  a speech  signal  or  in  space  as  of  an  image.  Many  pattern  recognition 
problems  are  context-sensitive.  For  example,  to  determine  if  a word  is  a noun  or 
a verb  or  anything  else  in  a sentence,  we  would  need  to  check  the  adjacent  words. 
When  using  neural  networks  on  a context-sensitive  problem,  special  architectures  are 
necessary  to  encode  and  learn  the  context  information. 

To  represent  the  inputs  over  a period  of  time,  a short-term  memory  structure  is 
needed.  Neural  net  memory  has  been  investigated  by  different  researchers  [Jordan, 
1986;  Mozer,  1989,  1993;  Elman,  1990;  Lang  et  ah,  1990;  de  Vries  and  Principe,  1992; 
Principe  et  ah,  1994],  mainly  for  the  problem  of  temporal  processing.  For  context- 
sensitive  pattern  recognition  problems,  the  neural  net  also  needs  similar  memory 
structure  to  hold  the  information  over  adjacent  events. 

1.2.4  The  Neural  Network  System  for  EEG  Sleep  Staging 

Neural  networks  incorporating  artificial  or  computational  intelligent  techniques 
are  very  suitable  for  signal  and  information  processing.  Electroencephalogram  (EEG) 
sleep  staging  is  a problem  which  requires  both  low-level  signal  detection  and  high- 
level  information  analysis.  It  is  an  important  and  challenging  engineering  problem, 
addressing  codification  of  human  knowledge.  Hence,  it  is  chosen  as  the  application 
area  in  this  research. 

Sleep  has  been  categorized  into  six  stages:  stage  awake,  1,  2,  3,  4,  and  REM 
(rapid  eye  movement).  Sleep  staging  is  traditionally  done  by  human  experts  reading 
electroencephalogram  (EEG),  electrooculogram  (EOG),  and  electromyogram  (EMG) 
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recordings.  Sleep  EEG/EOG/EMG  recordings  are  divided  into  30  or  60-second 
epochs.  Each  epoch  is  classified  into  one  of  the  six  stages. 

When  human  experts  score  the  EEG  sleep  data,  they  recognize  the  related  events 
first.  Then,  they  make  decisions  based  on  the  occurrence  and  duration  of  the  events. 
Sometimes,  they  have  to  check  the  temporal  context  to  refine  the  decisions.  The 
first  part  (signal  detection)  of  this  task  has  been  done  by  different  techniques  and 
the  man-machine  agreement  can  be  over  90%.  On  the  other  hand,  the  second  and 
third  parts  (information  analysis)  can  only  reach  about  85%  by  current  systems.  This 
research  is  intended  to  improve  these  results. 

The  neural  network-based  hybrid  system  developed  in  this  research  for  automated 
sleep  staging  consists  of  three  subsystems.  Each  subsystem  corresponds  to  a part  of 
the  human  decision  making.  The  first  subsystem  is  comprised  of  waveform  detectors 
which  recognize  related  EEG/EOG/EMG  events  and  summarize  them  in  minute 
tokens.  The  second  subsystem  is  a static  classifier  which  processes  token  informa- 
tion within  each  minute  and  produces  primary  results.  A rule-based  neural  network 
(RBNN)  incorporating  fuzzy  representation  is  used.  It  takes  both  the  knowledge 
of  human  experts  and  the  knowledge  inherent  in  the  data  into  account.  The  third 
subsystem  is  a context  analyst  which  processes  the  context  information  to  produce 
the  final  results.  A special  temporal  neural  network  is  used  for  this  purpose.  This 
research  concentrates  on  the  second  and  third  parts  of  the  system. 
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1.3  Overview 

This  research  is  intended  to  incorporate  fuzzy  representation  and  context  process- 
ing into  a neural  network  model.  A neural  network-based  hybrid  system  has  been 
built  for  information  processing  on  automated  sleep  staging.  Because  of  the  multi- 
faceted nature  of  the  research,  each  major  aspect  is  treated  in  a separate  chapter  to 
cover  basic  concepts,  past  related  work,  and  new  results  from  this  research. 

In  Chapter  2,  we  review  knowledge-based  neural  networks  with  a focus  on  rule- 
based  neural  networks.  The  knowledge-based  conceptual  neural  network  (KBCNN) 
[Fu,  1993]  is  introduced  in  detail.  The  certainty-factor-based  activation  function  is 
described.  Also,  the  KBCNN  is  compared  with  some  other  important  rule-based 
neural  networks  and  with  the  multilayer  perception. 

In  Chapter  3,  a fuzzification  method  is  proposed  for  the  neural  network  to  handle 
fuzziness.  A fuzzy  neural  network  can  be  constructed  by  building  in  a fuzzification 
layer.  We  first  review  concepts  of  fuzzy  sets  and  fuzzy  membership  functions  and 
related  work,  and  then  present  the  details  of  the  proposed  fuzzification  method. 

In  Chapter  4,  a new  neural  model  with  an  adaptive  memory  structure  developed 
for  context  learning  and  processing  is  presented.  We  first  review  neural  network  mod- 
els for  temporal  sequence/information  processing  with  emphasis  on  different  forms  of 
short-term  memory.  Then  the  proposed  context-processing  neural  model  is  described 
in  detail. 

In  Chapter  5,  the  EEG  sleep  staging  problem  is  addressed.  Definitions  of  sleep 
stages  and  automated  methods  are  reviewed.  Then  neural  network  approaches  are 
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discussed.  A hybrid  information  processing  system  is  presented,  which  augments 
rule-based  neural  networks  with  the  fuzzification  layer  and  the  context-processing 
neural  model. 

In  Chapter  6,  several  experiments  in  the  domain  of  EEG  sleep  staging  are  designed 
to  evaluate  the  developed  neural  system.  The  sleep  records  used  in  the  experiments 
are  analyzed  first.  Then  the  experiments  are  described,  and  the  results  are  presented 
and  discussed. 

In  Chapter  7,  discussions  and  conclusions  are  drawn. 

In  the  Appendix,  the  sleep  stage  profiles  of  the  sleep  records  used  in  the  experi- 
ments are  provided  with  figures. 
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Input  Hidden  Output 

Layer  Layers  Layer 


(a)  Feedforward  neural  network 


Output  Units 


(b)  Recurrent  neural  network 


Figure  1.1.  Two  categories  of  neural  networks 


CHAPTER  2 

KNOWLEDGE-BASED  NEURAL  NETWORKS 
2.1  Introduction 

The  knowledge-based  system  and  the  neural  network  are  two  very  different  ap- 
proaches to  human-made  intelligence.  The  knowledge-based  system  exploits  intensive 
human  knowledge  about  a problem  domain.  Human  knowledge  is  formulated  and 
used  as  the  basis  for  inference.  If  the  formulated  domain  knowledge  is  weak,  then 
the  knowledge-based  system  can  not  work.  On  the  other  hand,  the  neural  network 
needs  a set  of  data  which  covers  the  sample  space  well.  Network  performance  is 
optimized  by  adjusting  connection  weights  through  training  on  the  data.  If  the  data 
are  inadequate,  the  neural  network  can  not  learn  well. 

It  is  essential  to  incorporate  existent  knowledge  into  the  neural  network  framework 
such  that  both  knowledge  formulated  by  human  and  knowledge  inherent  in  data  can 
be  utilized.  In  the  meantime,  determination  of  the  optimal  configuration  of  a neural 
net  is  still  an  open  issue  in  neural  computation.  Human  knowledge  can  provide  help 
on  this  matter.  Thus,  integration  of  knowledge-based  systems  and  neural  networks 
is  one  direction  to  be  pursued. 

One  major  research  area  of  combining  knowledge  bases  and  neural  computing 
is  knowledge-based  neural  networks  which  deal  with  direct  transformation  from  a 
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knowledge  base  to  a neural  net.  A neural  network  transformed  from  a set  of  rules  is 
referred  to  as  the  rule-based  neural  network  (RBNN)  or  the  rule-based  connectionist 
network  [Fu,  1993,  1994].  Different  researchers,  e.g.,  Towell  et  al.  [1990],  Lacher 
et  al.  [1992],  Goodman  et  al.  [1992],  and  Fu  [1991,  1993],  have  tried  to  “incorporate” 
human  knowledge  into  neural  networks  by  constraining  the  network  topology  and 
assigning  appropriate  connection  weights. 

In  the  next  section,  the  knowledge-based  conceptual  neural  network  is  described 
in  detail  since  it  is  the  rule-base  neural  network  paradigm  chosen  for  this  research. 
Then,  it  is  compared  with  some  other  rule-based  neural  networks  and  multilayer 
perceptrons  (MLP). 

2.2  Knowledge-Based  Conceptual  Neural  Networks 

In  the  knowledge-based  conceptual  neural  network  (KBCNN)  model,  a set  of  pro- 
duction rules  can  be  mapped  into  a neural  network  structure,  where  data  attributes  or 
variables  are  assigned  to  input  units,  target  concepts  or  final  hypotheses  are  assigned 
to  output  units,  and  intermediate  concepts  or  hypotheses  are  assigned  to  hidden 
units.  Then,  the  rules  determine  the  connections  of  the  units  and  the  weights  of  the 
connections  [Fu,  1993]. 

Each  rule  in  the  KBCNN  has  a premise  (antecedent)  constituted  by  one  or  several 
conditions  and  a conclusion  (consequent).  The  standard  rule  format  is 

If  pi,  p2,  . . . , and  pn,  then  c 

where  p,  is  a condition  and  c is  the  consequent.  The  conjunction  of  p,’s  constitutes 
the  premise  part  of  the  rule.  Rules  with  the  disjunction  operator  can  be  split  into 
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Figure  2.1.  Network  connections  from  the  standard  rule  in  the  KBCNN 

a few  equivalent  rules  in  this  format.  The  corresponding  network  structure  of  the 
standard  rule  is  shown  in  Figure  2.1.  Each  rule  is  assigned  a conjunction  unit  which 
is  used  to  combine  the  conditions  in  the  premise.  But  if  there  is  only  one  condition 
in  the  premise,  then  the  conjunction  unit  is  not  necessary.  When  two  or  more  rules 
have  the  same  consequent,  the  conditions  of  different  rules  do  not  interact  improperly 
because  a conjunction  unit  is  created  for  each  rule  [Towell  et  al.,  1990]. 

The  initial  weights  of  the  connections  can  be  determined  in  the  following  way.  As- 
sume there  are  m positive  conditions  and  n negated  conditions  in  the  premise  of  the 
rule.  The  connection  weights  between  the  positive  condition  units  and  the  conjunc- 
tion unit  are  set  to  1/m.  And  the  connection  weights  between  the  negated  condition 
units  and  the  conjunction  units  are  initialized  as  — 1/n.  In  order  to  differentiate 
these  weights  from  small  random  weights,  the  absolute  value  of  these  weights  must 
be  greater  than  a predefined  number,  e.g.,  0.3.  The  bias  (negation  of  the  threshold) 
is  -0.2  for  the  conjunction  unit  and  0 for  the  disjunction  unit.  The  connection  weight 
between  the  conjunction  unit  and  the  consequent  unit  represents  the  rule  strength  of 
this  rule.  A number  in  the  range  of  [0,  1]  should  be  given  to  it  [Fu,  1993].  Finally, 
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the  network  is  perturbed  by  adding  small  random  numbers  to  all  the  weights  and 
biases  to  avoid  symmetry  breaking  problems  [Rumelhart  et  al.,  1986]. 

Extra  network  units  and  connections  can  be  added  to  a KBCNN  to  learn  potential 
rules  inherent  in  the  data.  These  extra  units  and  connections  are  referred  to  as 
potential  units  and  potential  connections.  This  is  important  especially  when  the 
domain  knowledge  is  incomplete.  The  weights  of  potential  connections  are  set  to 
small  random  numbers. 

Two  activation  functions  can  be  used  for  the  network  units  in  the  KBCNN.  The 
first  one  is  based  on  the  certainty-factor  (CF)  model: 

= /+(a;1,*2,...)  + /-(yi»y2»-)  (2-1) 

where 


f+(x1,x2,...)  = 1-Il(1_a:‘) 

X 

(2.2) 

f~  (j/i  > 2/2  j •••)  = -l+nu+w) 

(2.3) 

and  Xi ’s  are  positive  weighted  inputs,  and  yj ’s  are  negative  ones.  The  above- 
mentioned  initial  weight  assignment  is  based  on  this  CF-based  activation  function. 
When  this  function  is  used,  all  the  weights  and  activations  are  confined  in  the  range 
of  [0,  1]. 
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Table  2.1.  A set  of  artificial  rules 


A,  B,  and  not  C 

then  G 

0.8 

D 

then  G 

0.65 

E and  F 

then  H 

0.9 

G and  not  H 

then  I 

0.75 

Figure  2.2.  A KBCNN  mapped  from  the  rules  in  Table  2.1 

The  other  activation  function  used  in  the  KBCNN  is  the  sigmoid  function,  which 
was  introduced  in  Chapter  1.  The  KBCNN  is  a feed-forward  network  with  differen- 
tiable activation  functions.  Thus,  the  standard  back-propagation  procedure  [Rumel- 
hart  et  ah,  1986]  can  be  utilized  here  for  training  the  net. 

Figure  2.2  shows  a KBCNN  transformed  from  a set  of  artificial  rules  in  Table  2.1. 
The  number  associated  with  each  rule  is  the  rule  strength.  Circles  denote  disjunc- 
tion units  and  squares  denote  conjunction  units.  The  number  associated  with  each 
connection  is  its  weight  and  the  number  with  each  unit  is  its  bias. 
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The  way  to  construct  a KBCNN  from  a set  of  production  rules  has  been  illustrated 
in  this  section.  Other  work  on  rule-based  neural  networks  will  be  reviewed  and 
compared  in  the  next  section. 

2.3  Comparison  of  different  RBNNs 

In  addition  to  the  KBCNN,  three  other  methods  by  Towell  et  al.  [1990],  Goodman 
et  al.  [1992],  and  Lacher  et  al.  [1992]  are  closely  related  work  on  rule-based  neural 
networks  . In  this  section,  they  are  briefly  reviewed  and  compared  with  the  KBCNN. 

In  Towell  et  al.  [1990],  the  knowledge-based  artificial  neural  network  (KB ANN) 
was  introduced.  The  input,  hidden,  and  output  units  of  the  KBANN  are  determined 
in  the  same  manner  as  the  KBCNN  except  that  the  conjunction  units  are  created 
only  when  it  is  necessary.  The  conjunction  units  are  created  only  when  two  or  more 
rules  have  the  same  consequent.  Recall  that  in  the  KBCNN  one  conjunction  unit 
is  introduced  for  each  rule,  except  for  single-condition  rules.  The  disadvantage  of 
missing  conjunction  units  for  rules  with  multiple  conditions  is  that  when  potential 
units  and  connections  are  added  to  the  net,  the  original  configuration  might  need  to 
be  changed  because  of  creation  of  conjunction  units. 

The  weights  of  connections  from  positive  and  negative  conditions  are  set  to  w 
and  —w,  respectively.  The  bias  of  the  unit  to  which  the  condition  units  are  linked  (a 
consequent  unit  or  a conjunction  unit)  is  set  to  n * w — <j) , where  n is  the  number  of 
positive  conditions  and  <f>  is  a parameter  chosen  so  that  the  activation  of  the  unit  is 
above  0.9  when  the  premise  is  satisfied,  and  is  below  0.1  otherwise.  The  weights  from 
conjunction  units  to  consequent  units  are  set  to  w.  And  the  bias  of  the  consequent 
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unit  which  receives  activations  from  conjunction  units  instead  of  condition  units  is 
set  to  w — <f>  such  that  this  consequent  unit  is  active  when  either  of  the  conjunction 
units  is  active.  In  Towell  et  al.  [1990],  w = 3.0  and  (f>  — 2.3  are  suggested. 

The  rules  in  Table  2.1  is  used  to  construct  the  KB  ANN  shown  in  Figure  2.3. 
Notice  that  the  rule  strengths  of  the  rules  are  not  considered.  The  sigmoid  activation 
function  and  the  back-propagation  procedure  are  used  in  the  KBANN.  It  is  reported 
in  Towell  et  al.  [1990]  that  the  KBANN  can  outperform  the  multilayer  perceptron. 

Lacher  et  al.  [1992]  introduced  the  expert  network  which  is  an  event-driven, 
acyclic  network.  The  network  units  in  an  expert  network  are  classified  into  regular 
nodes  and  operation  nodes.  The  regular  nodes  correspond  to  conditions  or  conse- 
quents. The  operation  nodes  correspond  to  two  operators  (AND  and  NOT).  Each 
condition  node  is  linked  to  the  operation  node  to  which  it  is  related  when  there  is 
AND  or  NOT  operations  involved  in  the  rule;  otherwise,  it  is  directly  linked  to  the 
consequent  node.  Weights  of  the  incoming  connections  to  operation  nodes  are  set  to 
1,  which  are  hard- wired  and  are  referred  to  as  the  hard  weights.  All  other  weights 
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are  set  to  the  rule  strengths  (certainty  factors)  of  their  corresponding  rules.  These 
weights  are  referred  to  as  the  soft  weights  which  are  adjustable  during  learning.  No 
biases  are  assigned  to  any  of  the  nodes  in  an  expert  network. 

The  activation  function  of  the  nodes  in  the  expert  network  consists  of  two  func- 
tions: the  combining  function  and  the  output  function.  The  activation  function  of 
the  regular  node  is  also  CF-based,  but  it  follows  the  CF  model  more  rigorously  than 
the  one  used  in  the  KBCNN  (Eq.  2.1).  The  combining  function  of  the  regular  node 
is 


fcJ(x1,x2,...,y1,y2,...) 


f+(Xl,x2,...)  + f (y1,y2,...)  . 4x 

1 -min{|/+(xi,x2,-)U/“(yi»y2>— )l) 


where  /+()  and  /“()  are  the  same  functions  as  in  Eqs.  2.2  and  2.3.  The  divisor  is 
assumed  to  be  nonzero;  otherwise,  it  is  set  to  one.  Let  2 = /^(aq,  x2, ...,  j/i,  y2, ...), 
then  the  output  function  can  be  expressed  by 


0 if  2 <0.2 
2 otherwise 


(2.5) 


The  combining  function  of  the  conjunction  (AND)  node  is  the  minimum  operator. 
Whereas,  the  output  function  is  the  same  as  for  the  regular  node.  The  negation 
(NOT)  node  has  only  single  input,  so  it  does  not  need  a combining  function.  The 
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Figure  2.4.  An  expert  network  mapped  from  the  rules  in  Table  2.1 


output  function  of  the  negation  node  is  given  by 


fneg(z) 


1 if  2 < 0.2 
0 otherwise 


(2.6) 


where  2 is  the  input.  Back-propagation  is  also  used  here  to  train  the  soft  weights. 

An  expert  network  is  constructed  from  the  rules  in  Table  2.1  and  is  shown  in 
Figure  2.4.  Triangles  denote  negation  nodes.  Squares  denote  conjunction  nodes. 
And  the  numbers  with  a boldface  are  hard  weights;  the  rest  are  soft  weights. 

Goodman  et  al.  [1992]  introduced  another  form  of  rule-based  neural  networks.  As 
previous  approachs,  it  explicitly  encodes  knowledge  in  the  form  of  production  rules 
into  a network  structure.  But  it  uses  information  theory  to  estimate  the  weights 
instead  of  using  back-propagation  training  procedure.  The  network  behaves  as  a 
parallel  Bayesian  classifier,  but  it  can  estimate  posterior  probabilities  of  the  output 


variables. 
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In  this  work  no  intermediate  concepts  are  allowed.  That  is,  all  rule  consequents 
are  the  output  classes  and  all  conditions  are  the  input  variables.  One  conjunction 
node  is  created  for  each  rule.  So  a one-hidden-layer  network  is  constructed  from  a 
set  of  rules.  The  conjunction  node  here  acts  like  an  AND  gate.  When  a conjunction 
is  detected,  it  outputs  1;  otherwise,  it  outputs  0. 

The  weights  of  the  connections  between  the  nodes  corresponding  to  the  conditions 
and  the  conjunction  node  are  set  to  unity.  The  weight  of  the  connection  between  the 
conjunction  node  and  the  node  corresponding  to  the  consequent  is  the  rule  strength. 
The  rule  strength  here  is  the  conditional  probability  of  the  consequent  given  the 
premise  instead  of  the  certainty  factor  as  in  the  KBCNN  and  the  expert  network. 

This  approach  can  not  map  the  rules  in  Table  2.1  directly  into  a neural  network 
since  two  intermediate  concepts  exist  in  the  rule  set.  Therefore,  the  comparison  of 
network  topology  with  other  methods  is  skipped  here.  Goodman  et  al.  [1992]  show 
that  their  rule-based  neural  network  has  comparable  performance  with  multilayer 
perceptrons. 

Three  different  approaches  to  rule-based  neural  networks  are  briefly  described  in 
this  section.  The  comparison  with  the  KBCNN  is  emphasized.  In  the  next  section, 
the  KBCNN  is  compared  with  the  MLP. 

2.4  Comparison  of  the  KBCNN  and  the  MLP 

When  constructing  a multilayer  feedforward  neural  network,  the  number  of  hidden 
layers  and  the  number  of  hidden  units  are  usually  decided  empirically.  The  units 
between  adjacent  layers  are  fully  connected.  Then,  the  weights  of  the  connections 
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are  initialized  to  small  random  numbers.  When  the  network  size  is  large,  the  training 
procedure  would  be  very  time  consuming. 

The  KBCNN  (the  RBNN,  in  general)  maps  the  existent  domain  knowledge  in 
the  format  of  production  rules  directly  into  a neural  network  structure.  And  the 
connection  weights  are  set  accordingly  to  the  rules.  The  domain  knowledge  is  used 
to  constrain  the  network  topology  and  initialize  related  connection  weights.  This 
utilization  of  knowledge  can  produce  quite  good  results  when  the  knowledge  is  correct. 
Adding  extra  hidden  units  and  connections  can  relax  the  constraint  to  some  extent. 
However,  it  is  a purely  heuristic  action  and  the  effect  is  not  quite  certain. 

The  KBCNN  provides  a bridge  between  rule-based  systems  and  neural  networks. 
It  has  much  fewer  connections  than  the  fully-connected  MLP,  but  with  comparable  or 
even  better  performance  than  the  MLP.  The  advantage  is  especially  significant  when 
training  data  are  incomplete.  Because  of  smaller  network  size  plus  existent  initial 
knowledge,  the  KBCNN  needs  much  less  training  effort. 

2.5  Summary 

How  to  incorporate  human  knowledge  into  the  neural  network  is  the  main  focus 
in  this  chapter.  Several  rule-based  neural  networks  are  introduced,  which  use  human 
knowledge  in  the  form  of  production  rules  to  determine  the  neural  net  configuration 
and  initial  connection  weights.  In  a multilayer  feedforward  network,  each  network 
unit  is  usually  linked  to  all  the  units  at  the  adjacent  layers.  This  is  a simple  way  to 
do  it.  But  with  a large  number  of  connections,  the  training  and  processing  cost  is 
high.  Furthermore,  determination  of  the  number  of  hidden  layers  and  the  number  of 
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units  for  each  hidden  layer  is  always  an  issue.  Rule-based  neural  networks  provide 
solutions  to  these  two  problems. 

Knowledge-based  neural  networks  exploit  both  human  knowledge  and  the  knowl- 
edge inherent  in  the  data.  When  the  human  knowledge  is  strong,  the  corresponding 
neural  network  can  perform  pretty  well.  When  it  is  weak,  extra  units  and  connections 
are  necessary  for  learning  potential  knowledge  in  the  data. 

Upon  further  examination,  it  was  found  that  the  rule-based  neural  network  lacks 
two  aspects  of  human  knowledge.  First,  it  does  not  consider  the  fuzzy  nature  of 
human  knowledge.  Although  we  can  always  use  fuzzified  inputs  and  fuzzified  desired 
output  in  neural  network  training,  it  is  desirable  that  fuzzy  membership  functions 
are  part  of  the  rule-based  neural  network.  Secondly,  human  knowledge  about  context 
is  hard  to  encode  in  the  rule  format.  We  can  never  thoroughly  consider  all  possible 
situations  about  context.  Even  if  we  can,  the  number  of  rules  would  be  huge.  These 
two  issues  are  discussed  in  the  following  two  chapters. 


CHAPTER  3 

FUZZY  NEURAL  NETWORKS 
3.1  Introduction 

Unlike  computers,  people  are  not  always  precise.  People  think  and  reason  using 
linguistic  terms  such  as  “high”  and  “warm,”  rather  than  precise  numerical  terms 
such  as  “30000  feet”  and  “75  degrees.”  Furthermore,  people  can  make  decisions 
other  than  absolute  “yes/no.”  Fuzzy  logic  provides  a computer  with  the  capability 
to  make  the  same  kind  of  decision  that  people  do.  Fuzzy  logic  fits  best  when  the 
problem  is  concerned  with  continuous  phenomena  which  are  not  easily  broken  down 
into  discrete  segments,  and  when  the  problem  is  difficult  to  model  by  mathematics 
[Cox,  1992]. 

Neural  networks  are  capable  of  model-free  estimation  since  it  has  been  proved 
that  a multilayer  feedforward  network  is  a universal  mapper,  i.e.,  it  can  approximate 
any  function  to  any  degree  of  accuracy  [Hecht-Nielsen,  1987]. 

Both  neural  networks  and  fuzzy  systems  provide  a means  to  properly  “model”  a 
problem  without  using  complex  mathematics.  The  relationship  between  neural  and 
fuzzy  systems  is  discussed  in  Kosko  [1992].  It  is  desired  that  we  can  exploit  the 
representation  power  of  fuzzy  logic  and  the  learning  power  of  neural  networks  at  the 
same  time.  This  results  in  the  research  on  fuzzy  neural  networks. 
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Fuzzy  logic  has  been  applied  to  expert  systems.  It  would  be  useful  to  take  the 
fuzzy  nature  of  human  knowledge  into  consideration.  With  fuzzy  logic,  rules  are 
written  in  terms  of  imprecise  ideas  of  what  constitutes  the  states  of  a variable.  Fuzzy 
rules  are  more  natural,  and  also  more  expressive.  Fuzzy  expert  systems  have  had  a 
great  success  in  recent  years. 

A fuzzification  method  is  proposed  for  the  neural  network  to  handle  fuzziness.  A 
fuzzy  neural  network  can  be  constructed  by  building  in  a fuzzification  layer.  The 
fuzzification  layer  is  designed  for  both  general  feedforward  and  rule-based  neural 
networks. 

In  this  chapter,  we  first  review  concepts  of  fuzzy  sets  and  fuzzy  membership 
functions  and  related  work,  and  then  present  the  details  of  the  proposed  fuzzification 
method. 

3.2  Fuzzy  Sets  and  Fuzzy  Membership  Functions 

Fuzzy  logic  works  through  the  use  of  fuzzy  sets.  Traditional  sets  impose  rigid 
membership  requirements  upon  the  objects  within  a set.  An  object  is  either  com- 
pletely in  the  set,  or  it  is  not  in  the  set  at  all.  Another  way  of  saying  this  is  that  an 
object  is  a member  of  the  set  to  a degree  of  1 or  0.  For  example,  the  set  of  “fast”  car 
could  be  defined  as  all  cars  with  a speed  of  70  miles  per  hour  or  higher.  We  would 
say  a car  with  a speed  of  70  miles  per  hour  is  “fast,”  but  a car  with  a speed  of  65 
miles  per  hour  is  “not  fast.”  The  traditional  set  classifies  the  speed  of  a car  as  either 
completely  fast,  or  not  fast  at  all  - there  is  no  middle  ground.  But  in  this  example, 
most  people  would  say  a car  with  a speed  of  65  miles  per  hour  is  “somewhat”  fast. 
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Fuzzy  sets  have  more  flexible  membership  requirements  which  allow  for  partial 
membership  in  a set.  The  degree  to  which  an  object  is  a member  of  a fuzzy  set  can 
be  any  value  between  0 and  1,  rather  than  strictly  0 or  1 as  in  a traditional  set.  With 
the  fuzzy  set,  there  is  a gradual  transition  from  membership  to  nonmembership.  For 
example,  a car  with  a speed  of  70  miles  per  hour  is  fast  to  a degree  of  0.7,  and  a car 
with  a speed  of  65  miles  per  hour  is  fast  to  a degree  of  0.5. 

One  of  the  major  tasks  of  implementing  fuzziness  is  the  decision  of  the  term 
set  (or  the  membership  function).  The  term  set  can  be  decided  through  subjective 
evaluation.  Or  it  can  be  constructed  by  some  ad  hoc  methods,  e.g.,  the  7r  membership 
function  used  in  Pal  and  Mitra  [1992].  Or  it  can  be  converted  from  probability  curves 
or  frequency  histograms.  Or  it  can  be  determined  through  learning  and  adaptation. 
An  adjustable  membership  function  is  the  most  desired  way  to  handle  the  fuzzification 
of  the  inputs.  But  it  is  always  important  to  know  that  the  membership  function 
should  not  be  regarded  as  probabilities. 

Triangles  and  trapezoids  are  commonly  used  to  shape  fuzzy  membership  functions 
by  human  experts.  Figure  3.1  shows  an  example  of  so-determined  fuzzy  membership 
functions.  These  functions  can  be  approximated  by  sigmoid  functions  with  different 
thresholds  and  gains.  We  will  show  how  this  can  be  accomplished  in  Section  3.4. 

3.3  A Brief  Review  of  Fuzzy  Neural  Networks 

A great  number  of  fuzzy  neural  networks  have  been  proposed  in  recent  years. 
Most  of  them  try  to  implement  fuzzy  logic  operations,  e.g.,  AND,  OR,  and  NOT,  in 
a neural-network-like  structure  by  designing  special  neurons  and/or  network  topology. 
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Figure  3.1. 


An  example  of  fuzzy  membership  functions 


In  fuzzy  logic,  the  AND  and  OR  operators  are  realized  as  min  and  max  or  product 
and  sum.  The  NOT  operator  is  realized  as  ->(r)  = 1 — r,  where  r is  a degree  of 
membership. 

Yager  [1992]  proposed  a special  neuron  known  as  the  OWA  (ordered  weighted 
averaging)  neuron  for  multi-criteria  aggregation.  Hsu  et  al.  [1992]  designed  some 
subnetwork  structures  to  realize  the  fuzzy  operators.  Yuan  et  al.  [1992]  used  two 
differentiable  functions  to  approximate  min  and  max.  Pedrycz  and  Rocha  [1993] 
implemented  OR  and  AND  neurons  with  the  use  of  triangular  norms.  All  of  these 
specially  designed  networks  employ  a back-propagation-type  training  algorithm. 

The  fuzzy  ART  [Carpenter  et  ah,  1991]  is  the  fuzzy  version  of  the  ART  1 (adaptive 
resonance  theory)  networks  [Carpenter  and  Grossberg,  1987].  The  ART  1 network  is 
based  on  an  unsupervised  competitive  learning  scheme,  which  is  very  different  from 
a back-propagation  network.  The  fuzzy  min-max  neural  network  [Simpson,  1992, 
1993]  is  also  inspired  by  the  ART  networks.  Each  fuzzy  set  is  the  aggregate  (union) 
of  hyperboxes  which  are  constructed  from  a min  point  and  a max  point. 
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Pal  and  Mitra  [1992]  used  the  multilayer  perception  to  learn  “fuzzy”  mappings 
by  back-propagation  with  a weight  decay  term.  An  input  block  consisting  of  the  7r 
membership  function  was  placed  in  front  of  the  MLP  to  fuzzify  the  input  features 
presented  in  quantitative  and/or  linguistic  forms.  The  desired  outputs  were  fuzzy 
class  membership  values. 

Ishibuchi  et  al.  [1993]  also  proposed  a neural  network  architecture  to  learn  fuzzy 
nonlinear  mappings.  Fuzzy  if-then  rules  were  rewritten  as  “fuzzy  data”  which  were 
used  along  with  measured  data  as  the  training  data.  Learning  algorithms  based  on 
back-propagation  were  derived  for  “interval”  input  vectors  paired  with  normal  output 
vectors  or  with  “interval”  output  vectors. 

Horikawa  et  al.  [1992]  developed  three  types  of  fuzzy  models  using  fuzzy  neural 
networks.  The  neural  networks  can  learn  fuzzy  rules,  which  describe  the  nonlinear 
system,  from  the  data  and  tune  the  membership  functions  at  the  same  time. 

3.4  A Fuzzification  Laver  for  Neural  Networks 

The  multilayer  perceptron  is  a universal  mapper  which  can  approximate  any  con- 
tinuous function.  It  can  also  approximate  associations  of  fuzzy  input-output  pairs. 
Pal  and  Mitra  [1992]  show  good  results  in  learning  fuzzy  mappings  by  multilayer 
perceptions.  Thus,  it  seems  unnecessary  to  strictly  follow  fuzzy  operations,  e.g.,  min 
and  max,  when  constructing  a fuzzy  neural  network.  It  is  justified  to  incorporate 
fuzziness  into  neural  networks  by  superimposing  a fuzzification  layer,  which  fuzzifies 


continuous  inputs,  on  the  nets. 
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Another  reason  to  add  a fuzzification  layer  to  neural  networks  is  that  by  so  doing 
the  fuzzy  membership  functions  can  be  refined  during  the  training  process.  Horikawa 
et  al.  [1992]  used  the  sigmoid  functions  to  form  fuzzy  membership  functions  in  neural 
networks.  A fuzzification  layer  can  be  easily  constructed  using  neurons  with  sigmoid 
activation  functions  and  specially  designed  interconnections. 

This  fuzzification  layer  can  be  superimposed  on  any  neural  network  (feedforward 
or  recurrent)  by  placing  it  between  the  input  layer  and  the  rest  of  the  net.  It  is  espe- 
cially suitable  for  rule-based  neural  networks  since  both  of  them  involve  constraints 
on  network  connections  and  their  corresponding  weights.  A “fuzzy”  rule-based  neural 
networks  can  easily  be  built  under  the  original  RBNN  framework  with  the  fuzzifica- 
tion layer. 

The  inputs  to  this  fuzzy  neural  network  model  are  continuous  data  from  mea- 
surement. The  outputs  can  be  crisp  or  fuzzy  numbers.  A defuzzification  process 
is  necessary  when  the  outputs  are  fuzzy  numbers.  There  are  several  ways  to  de- 
fuzzify  fuzzy  results  to  crisp  values.  The  two  most  commonly  used  techniques  are  the 
maximum  and  the  centroid  methods  [Kosko,  1992]. 

In  this  section,  we  will  describe  how  to  form  bell-shaped  functions  by  sigmoids 
and  how  to  construct  the  fuzzification  layer  using  sigmoid  neurons.  Two  ways  of 
weight  assignment  and  their  respective  back-propagation-type  training  algorithms 
are  presented  and  formulated.  Here,  only  combination  of  the  fuzzification  layer  with 


feedforward  neural  networks  is  considered. 
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3.4.1  Bell-Shaped  Functions  Constructed  by  Paired  Sigmoids 

A pair  of  sigmoids  are  capable  of  forming  a bell-shaped  function  with  a single 
argument.  This  can  be  done  by  finding  the  difference  between  two  sigmoids  with 
different  thresholds  and  gains.  Consider  a typical  neuron  with  the  sigmoid  activation: 

xj  - f*(Pj  (I]  Xi  - 6j))  (3.1) 

i 

In  this  equation,  fs  is  the  sigmoid  function  in  Eq.  1.1,  j denotes  the  current  unit,  i 
denotes  the  units  with  outgoing  connections  to  unit  j,  x,  and  Xj  are  the  activations, 
is  the  weight  from  unit  i to  unit  j,  9j  is  the  threshold  of  unit  j,  and  (3j  is  the  gain 
which  controls  the  steepness  of  the  sigmoid  function.  Since  the  bell-shaped  function 
has  only  a single  argument  (input),  with  = 1,  Eq.  3.1  is  reduced  to 

xj  = MPj  (xi  ~ ei))  (3-2) 

The  same  input  is  fed  into  a pair  of  such  sigmoid  neurons.  Figure  3.2  illustrates  the 
formation  of  a bell-shaped  function  by  two  sigmoids. 

In  implementation,  we  could  set  the  weights  of  the  connections  of  the  paired 
neurons  to  any  other  neuron  to  the  same  absolute  value,  but  with  different  signs.  By 
changing  the  thresholds  and  the  gains  of  the  paired  neurons,  different  bell-shaped 
functions  can  be  obtained.  The  thresholds  determine  the  locations  of  the  two  sides, 
and  the  gains  control  the  steepness  of  the  sides.  The  formed  “bell-shaped”  function 
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Figure  3.2.  A subnetwork  with  a pair  of  sigmoid  neurons  and  its  corresponding  bell- 
shaped function.  In  (b),  the  solid  line  (y3)  is  the  difference  between  the  line  with 
asteroids  (yl)  and  the  dashed  line  (y2). 

could  be  unsymmetrical,  and  it  can  approximate  any  triangles  and  trapezoids  which 
are  commonly  used  for  fuzzy  membership  functions. 

Radial  basis  functions  are  also  popular  activation  functions  for  neural  networks 
[Moody  and  Darken,  1988,  1989].  The  mostly  used  radial  basis  function  (RBF)  is 
the  Gaussian  function  which  is  also  a bell-shaped  function.  But  the  learning  of 
the  RBF  neural  network  involves  self-organization,  and  is  different  from  the  back- 
propagation  procedure.  Also,  because  the  RBF  is  symmetric,  it  cannot  approximate 
any  trapezoids  freely. 

3.4.2  Fuzzification 

In  the  previous  subsection,  it  is  shown  that  paired  sigmoid  neurons  can  form 
a “bell-shaped”  function,  which  can  approximate  any  trapezoids  and  triangles.  In 
Figure  3.3(a),  sample  membership  functions  for  minute  tokens  of  EEG  alpha  wave  are 
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SCARCE  LOW  MEDIUM  HIGH  SCARCE  LOW  MEDIUM  HIGH 


Figure  3.3.  Membership  functions  for  alpha  tokens 

shown.  There  are  four  membership  functions  for  the  alpha-wave  token:  SCARCE, 
LOW,  MEDIUM,  and  HIGH,  which  correspond  to  different  levels  of  the  alpha  activity 
during  one  minute.  The  fuzzy  membership  functions  are  in  the  shape  of  trapezoids. 
In  Figure  3.3(b),  sigmoid  functions  are  used  to  approximate  the  fuzzy  membership 
functions 

It  is  illustrated  in  Figure  3.4  how  the  fuzzy  encoding  can  be  achieved  in  a neural 
network  structure  by  using  sigmoid  functions  with  an  adjustable  gain  (assuming  there 
is  only  one  input  variable).  The  fuzzification  layer  here  is  corresponding  to  the  mem- 
bership functions  in  Figure  3.3.  In  the  top-down  order,  the  first  neuron  corresponds 
to  the  SCARCE  membership  function,  the  second  and  third  ones  correspond  to  the 
LOW,  the  fourth  and  fifth  ones  correspond  to  the  MEDIUM,  and  the  sixth  one  cor- 
responds to  the  HIGH.  For  the  LOW  and  the  MEDIUM  membership  functions,  the 
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Figure  3.4.  A fuzzification  layer  for  neural  networks 


combination  of  the  outputs  of  corresponding  paired  neurons  produces  a bell-shaped 
membership  function.  Notice  that  the  gain  (steepness)  of  the  first  neuron  is  always 
negative. 

A fuzzification  layer  can  be  constructed  between  the  input  and  the  hidden  layers 
to  encode  continuous  inputs.  For  an  input  variable  with  n fuzziness  levels,  2(n  — 1) 
neurons  with  the  sigmoid  activation  function  are  needed  in  the  fuzzification  layer.  The 
threshold  and  the  gain  of  each  neuron  in  the  fuzzification  layer  are  adjustable.  In 
this  way,  optimal  shapes  of  the  fuzzy  membership  functions  can  be  obtained  through 
training. 

3.4.3  Weight  Assignment  and  the  Training  Algorithm 

For  a typical  back-propagation  neural  network,  the  activation  function  of  the 
neurons  is  given  by  Eq.  3.1  with  (3j  fixed  to  1.  The  weight  adaptation  equations  as 
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summarized  in  Lippmann  [1987]  are  as  follows  (the  momentum  term  is  skipped): 


+ 1)  = + 7/  6j  x, 


(3.3) 


where  77  is  the  learning  rate,  and  Sj  is  the  error  term.  If  j is  an  output  neuron,  then 


6^  - Xj  (1  - xj)  (dj  - xj) 


(3.4) 


where  dj  is  the  desired  output.  If  j is  a hidden  neuron,  then 


=Xj(l-Xj)  Y,  Sk  wjtk 
k 


(3.5) 


where  k denotes  all  the  neurons  to  which  neuron  j has  an  outgoing  connection.  The 
threshold  is  the  negation  of  the  bias.  The  bias  is  viewed  as  the  weight  from  a bias 
neuron  with  activation  1.  Thus, 


«,((  + !)  = 0,U)  -qSj 


(3.6) 


To  optimize  the  cost  function  E as  in  Eq.  1.2  by  gradient  decent,  the  parameter 
Xj  in  a neural  network  is  adjusted  by 


AXj  = -r, 


dE 

dXj 


(3.7) 
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Using  the  chain  rule, 


dE  _ dE  dnetj 
d\j  dnetj  d\j 


Thus,  Eq.  3.7  is  changed  to 


AAj 


= V$j 


dnetj 

d\, 


(3.8) 

(3.9) 


(3.10) 


The  activation  function  of  the  neurons  in  the  fuzzification  layer  is  Eq.  3.2,  where 
netj  is  equal  to  /3j  (%i  ~ @j)-  The  partial  derivatives  of  netj  with  respect  to  the  two 
parameters  0j  and  /3j  are 


dnetj 

dnetj 


(3.11) 

(3.12) 


Therefore,  the  following  equations  for  adaptation  of  the  threshold  and  the  gain  of  the 
fuzzification  neuron  can  be  obtained. 


A$j(t  + 1)  = -i 


(3.13) 


A/3,(M- 1)  = r/  ^ (a:,  - 0,(1)) 


(3.14) 
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Some  researchers  have  used  neurons  with  an  adjustable  gain  in  multilayer  feed- 
forward networks.  They  use  this  kind  of  neurons  in  the  whole  network  with  different 
purposes  in  mind.  Tawel  [1989]  intended  to  let  the  neuron  learn  like  synapse.  He 
concluded  that  the  required  number  of  training  cycles  is  significantly  decreased.  Kr- 
uschke  and  Movellan  [1991]  also  claimed  that  adjustable  gain  parameters  can  speed 
up  learning  and  minimize  the  number  of  hidden  layers  with  improved  generalization. 

To  simplify  the  training  of  the  threshold  and  the  gain  of  a fuzzification  node, 
Eq.  3.2  is  modified  as 


Xj  — fsiflj  (Xi  Oj)) 

= fsiPjXi-frOj) 

= f5(wi'jXi  - Oj)  (3.15) 

where  tu,j  = f3j  is  the  weight  of  the  connection  between  input  neuron  i and  fuzzifica- 
tion neuron  j (this  weight  was  set  to  1),  and  03  is  the  new  threshold.  The  gain  of  the 
neuron  is  now  fixed  to  1.  So,  the  standard  back-propagation  procedure  (the  case  of 
a fixed  gain)  can  be  applied.  This  technique  simplifies  the  training  process,  avoiding 
Eq.  3.13  and  Eq.  3.14.  After  training,  the  adapted  fuzzy  membership  functions  can 
be  recovered  by  finding  (3j  = Wij  and  Oj  = Oj  / (3j. 

As  mentioned  in  Subsection  3.4.1,  we  need  to  control  the  weights  of  outgoing 
connections  of  paired  neurons  to  any  particular  neuron  at  the  same  magnitude  but 
with  different  signs.  For  example,  in  Figure  3.4,  w3  should  be  equal  to  — w2 , and  w5 
should  be  equal  to  — wi.  So  the  adaptation  procedure  of  these  weights  needs  a minor 
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change.  Actually,  the  paired  neurons  can  be  viewed  as  one  unit  (let  us  call  it  unit 
/)  since  they  represent  the  same  concept.  The  activation  of  this  unit,  which  is  the 
difference  between  the  activations  of  the  paired  neurons,  is  sent  to  another  neuron 
(indexed  by  k ) through  a weighted  connection.  In  standard  back-propagation,  the 
weight  of  this  connection  is  adjusted  by 


A witk  = r]8kzi 

(3.16) 

Wi,k{t  + 1)  = wi,k{t)  + Awi}k 

(3.17) 

where  27  = Xj  — is  the  combined  activation  of  paired  neurons  j and  j + 1.  We 
need  to  control  Wjtk  to  be  witk,  and  Wj+i^  to  be  —witk- 


Aw{<k  = T}Sk(xj  - Xj+ 1) 


(3.18) 


= TjhXj  - T]SkXj+1 


(3.19) 


Thus,  the  adaptation  of  Wjtk  and  Wj+iik  should  follow 


A wjjk 

AiUj+i  <k 
wjtk(t  + 1) 
Wj+l,k(t  + 1) 


= f]8k  Xj 
77  8 k Xj+i 

= wjtk{t)  + (A Wj'k  - Awj+ltk) 

— (A  Wj'k  Awj-j-i^) 


(3.20) 

(3.21) 

(3.22) 


(3.23) 
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3.5  Summary 

Fuzziness  is  one  important  aspect  of  human  thinking  and  reasoning.  In  order  to 
incorporate  fuzziness  into  the  neural  network,  a fuzzification  layer  is  built  to  serve  as 
the  fuzzy  membership  functions.  Sigmoids  are  used  to  approximate  commonly  used 
membership  functions  - triangles  and  trapezoids.  The  fuzzy  membership  functions 
residing  in  the  fuzzification  layer  can  be  tuned  through  the  adaptation  of  the  thresh- 
olds and  the  gains  of  the  neurons.  The  fuzzification  layer  can  be  superimposed  on 
any  neural  networks  to  fuzzify  continuous  inputs  which  are  not  easily  discretized. 

The  fuzzification  layer  provides  a simple  way  to  incorporate  fuzziness  into  back- 
propagation  neural  networks.  It  can  be  added  to  both  general  feedforward  networks 
and  rule-based  neural  networks.  Fuzzy  rules  can  be  easily  handled  with  the  combi- 
nation of  the  fuzzification  layer  and  the  rule-based  neural  network.  Only  one  minor 
change  is  needed  to  use  the  standard  back-propagation  procedure  to  train  a feedfor- 
ward net  with  a fuzzification  layer.  That  is,  we  need  to  control  the  connection  weights 
from  paired  neurons  to  any  other  neurons  at  same  magnitude  but  with  different  signs. 
With  the  simplified  weight  assignment  the  more  bothersome  training  procedure  for 
the  gain  (steepness)  and  the  threshold  can  be  avoided. 


CHAPTER  4 

NEURAL  MODELING  FOR  CONTEXT  ANALYSIS 
4.1  Introduction 

Context  refers  to  the  surroundings  relevant  to  the  event  of  interest.  This  relation- 
ship of  adjacency  could  be  in  time  or  in  space.  Context  information  is  essential  in 
pattern  recognition  and  other  decision-making  tasks.  It  is  also  an  important  aspect 
of  human  knowledge.  Information  corresponding  to  only  the  current  event  can  give 
some  clues  for  decision  making.  But  quite  often  context  information  is  needed  to 
make  a better  decision.  For  example,  to  decide  if  “double”  in  the  sentence  “I  double 
the  parts  of  the  father  and  the  son”  is  an  adjective,  a noun,  an  adverb,  or  a verb,  we 
need  to  check  “I”  and  “the  parts”  first. 

In  temporal  information  processing,  neural  networks  with  a short-term  memory 
structure  are  capable  of  learning  context  information.  Short-term  memory  can  be 
extended  by  delays  or  by  positive  feedback  in  a neural  network  structure.  But,  short- 
term memory  in  the  form  of  delays  has  boundaries,  so  the  context  is  restricted  to 
a fixed  region  (we  call  this  fixed  region  a window).  On  the  other  hand,  short-term 
memory  in  the  form  of  positive  feedback  can  stretch  its  memory  outside  the  window. 
The  gamma  neural  model  [de  Vries  and  Principe,  1992]  is  a special  case  of  such 
networks  with  well-specified  and  adaptable  memory  structures. 
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McClelland  and  Elman  [1986]  are  the  first  to  investigate  context  learning  using 
neural  networks.  Each  unit  in  the  TRACE  model,  developed  by  McClelland  and 
Elman  to  handle  speech  perception,  stands  for  a hypothesis  occurring  at  a particular 
point  in  time.  Perceptual  processing  continues  on  previous  portions  of  the  input 
even  when  the  current  portion  is  fed  into  the  system.  These  continuing  interactions 
allow  the  model  to  include  subsequent  context  information  in  the  process.  This  is  a 
version  of  short-term  memory  by  delays.  The  prior  information  and  the  subsequent 
information  of  the  current  point  are  referred  to  as  the  left  context  and  the  right 
context,  respectively.  The  main  issue  in  this  chapter  is  how  to  properly  represent  the 
right  context  in  the  neural  network. 

A new  neural  model  with  an  adaptive  memory  structure  has  been  developed  for 
context  learning  and  processing.  The  neural  model  is  built  by  extending  the  gamma 
model  (a  neural  model  for  temporal  processing)  to  accommodate  anticausal  memory, 
hence  the  name  “extended”  gamma  model.  With  the  anticausal  memory  structure, 
the  extended  model  can  take  both  the  left  and  right  contexts  into  account. 

In  this  chapter,  we  first  review  neural  network  models  for  temporal  sequence  and 
information  processing  with  emphasis  on  different  forms  of  short-term  memory.  Then 
the  extended  gamma  model  is  described  in  detail. 

4.2  Neural  Models  for  Temporal  Processing 

There  are  several  neural  net  architectures  designed  specifically  for  temporal  pro- 
cessing. The  difference  between  these  neural  networks  is  the  forms  of  short-term 
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memory.  With  a short-term  memory  structure,  neural  networks  can  store  and  pro- 
cess temporal  relationship  among  data.  There  are  two  major  ways  to  extend  the 
short-term  memory  capability  of  a neural  model.  The  first  one  is  by  explicit  de- 
lays. The  second  one  is  by  positive  feedback  (or  recurrent  connections)  [de  Vries 
and  Principe,  1992].  With  the  combination  of  these  two  methods,  different  forms  of 
short-term  memory  can  be  produced.  A general  equation  for  all  possible  memory 
structures  can  be 


M(t  + l)  = s(M(t),I(t))  (4.1) 

where  M[t)  represents  the  memory  at  time  t,  I(t ) is  the  external  input  at  time  f,  and 
s is  a function  which  determines  the  form  of  short-term  memory. 

The  simplest  form  of  the  short-term  memory  structure  is  the  tapped  delay  line 
memory,  which  is  implemented  by  a buffer  containing  the  n most  recent  inputs. 
This  kind  of  memory  is  very  common  in  neural  net  models.  In  the  time  delay  neural 
network  (TDNN)  [Waibel  et  ah,  1989;  Lang  et  ah,  1990],  each  TDNN  unit  can  include 
a number  of  delays.  This  enables  the  unit  to  relate  and  compare  the  current  input 
to  the  history  of  events. 

Another  memory  form  is  called  the  exponential  trace  memory  which  involves  some 
recurrent  connections  [Mozer,  1993].  As  the  name  suggests,  in  this  kind  of  memory 
the  strength  of  an  input  decays  exponentially  with  time.  More  recent  inputs  are  given 
greater  strength  than  more  distant  inputs.  On  the  contrary,  the  tapped  delay  line 
memory  gives  equal  weights  to  the  same  input  at  different  time  slots,  but  it  sharply 
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drops  off  at  the  boundaries  of  the  time  window.  The  following  is  four  neural  net 
models  with  the  exponential  trace  memory. 

In  Jordan  [1986],  state  units  are  added  to  the  input  layer  of  a multilayer  percep- 
tron.  Then,  feedback  connections  from  the  state  units  to  themselves  and  from  the 
output  units  to  the  state  units  are  implemented  to  allow  the  current  state  to  depend 
on  the  previous  state  and  on  the  previous  output.  The  weights  of  the  self-connections 
of  state  units  are  fixed  to  a with  0 < a < 1.  The  weights  of  the  connections  from 
the  output  units  to  the  state  units  are  fixed  to  one. 

In  Elman  [1990],  context  units  are  also  placed  at  the  input  layer,  but  they  have 
feedback  connections  from  only  the  hidden  units.  The  weights  of  the  feedback  con- 
nections are  also  fixed  to  a in  the  region  of  (0,  1).  The  activations  of  the  hidden  units 
represent  the  internal  states.  Thus,  the  context  units  allow  the  internal  representation 
of  time. 

Stornetta  et  al.  [1988]  proposed  self-connections  of  the  input  units.  The  weights 
of  the  connections  are  also  fixed  to  a in  (0,  1).  The  input  is  in  fact  preprocessed 
by  this  feedback  structure  before  it  reaches  the  rest  of  the  network.  The  process  is 
equivalent  to  an  infinite  impulse  response  (HR)  digital  filter  with  transfer  function 
1/(1  — az"1)  [Hertz  et  al.,  1991]. 

In  Mozer  [1989],  a context  layer  is  inserted  between  the  input  layer  and  the  hidden 
layer  of  a multilayer  perceptron.  The  units  in  the  context  layer  have  self-connections 
with  a weight  a which  is  now  adjustable.  A training  algorithm  according  to  back- 
propagation  can  be  derived  without  back-propagation  “in  time.” 
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The  above-mentioned  neural  networks  are  so-called  partially  recurrent  networks 
[Hertz  et  al.,  1991],  which  have  carefully  selected  feedback  connections.  The  computa- 
tional flexibility  of  this  kind  of  memory  is  restricted  in  time.  The  possible  instability 
due  to  the  recurrence  is  also  a concern. 

The  third  memory  form  is  the  gamma  memory  which  exploits  both  methods  for 
enlarging  the  short-term  memory  capability  (by  delays  and  by  positive  feedback). 
The  significance  of  this  kind  of  memory  is  the  adaptability  of  the  memory  depth 
and  resolution  [de  Vries  and  Principe,  1992].  The  tapped  delay  line  memory  is  low 
depth  but  high  resolution,  whereas  the  exponential  trace  memory  is  high  depth  but 
low  resolution.  The  gamma  memory  can  achieve  the  tradeoff  between  resolution  and 
depth  through  adaptation.  It  generalizes  across  the  tapped  delay  line  and  exponential 
trace  memory  [Mozer,  1993].  In  the  next  section,  we  will  describe  the  gamma  memory 
model  and  show  how  it  can  be  extended  to  incorporate  information  about  the  events 
following  a reference  point  in  time. 

4.3  The  Extended  Gamma  Model 

This  section  starts  with  a brief  review  of  the  discrete  gamma  memory  structure. 
Assume  that  the  input  7(f)  to  the  gamma  memory  structure  is  an  V-dimensional 
signal.  The  past  of  this  signal  is  represented  in  the  gamma  memory  structure  by 

Xi,0  (t)  = Ii(t) 


— (1  /^t)  l)  T 1 {t  I) 


(4.2) 


47 


-V 


xix(f) 


Figure  4.1.  The  discrete  gamma  memory  structure 


where  i = 1, . . . , TV,  k = 1, . . . , K,  and  t — 0, 1, . . . , tf.  K is  the  order  of  the  memory 
structure  and  p,-’s  are  the  memory  parameters  of  the  memory  banks.  Each  memory 
bank  is  a tapped  delay  line  with  locally  recurrent  connections.  The  weights  of  the 
connections  between  taps  and  the  recurrent  connections  are  /i,-  and  1 — respectively, 
for  the  ith  memory  bank.  Figure  4.1  shows  the  discrete  gamma  memory  structure 
(only  one  memory  bank  is  shown)  [de  Vries  and  Principe,  1992;  Principe  et  al.,  1994]. 

The  memory  depth  ( D ) defined  by  the  center  of  mass  of  the  last  memory  tap  of 
each  memory  bank  is  K/ /x,.  The  resolution  ( R ) is  defined  by  Thus, 


K = D x R 


(4.3) 


For  a JT-th  order  gamma  memory,  there  is  a tradeoff  between  the  memory  depth  and 
resolution  through  adaptation  of  p,.  It  can  be  shown  that  the  gamma  memory  is 
stable  for  0 < //,  < 2.  When  0 <//,•<  1,  it  implements  a tapped  “dispersive”  delay 
line  [de  Vries  and  Principe,  1992].  Thus,  the  memory  can  include  the  information 
outside  the  window  defined  by  the  order  of  the  memory  structure. 
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The  gamma  model  was  originally  designed  for  temporal  processing.  With  its 
recursive  memory  structure,  the  gamma  model  can  stretch  the  memory  outside  the 
window.  But  since  it  is  a causal  structure,  only  information  about  the  events  proceed- 
ing the  current  event  is  contained  in  the  gamma  memory.  For  the  purpose  of  context 
analysis,  information  about  the  events  following  the  current  event  is  also  important. 
It  is  desirable  that  the  memory  of  the  subsequent  events  also  has  adaptive  memory 
depth.  Thus,  an  anticausal  gamma  memory  structure  is  introduced  to  encode  the 
information  about  the  subsequent  events. 

4.3.1  Anticausal  Memory 

The  anticausal  memory  is  implemented  by  running  the  data  backwards  through 
another  set  of  gamma  memory  banks.  Before  the  causal  processing  starts,  the  input 
sequence  is  fed  into  this  second  gamma  memory  structure  from  the  end, 

Xi,0{t)  = Ii(t) 

Xi,i{t)  = (1  - Vi)  xiti(t  + 1 ) + Vi  Xi,i-i(t  + 1)  (4.4) 

where  / = 1, . . . , L and  t = tj,tj  — 1, . . . , 0.  L is  the  order  of  the  anticausal  gamma 
memory  structure  and  i/,-’s  are  the  memory  parameters.  So  when  the  causal  processing 
starts,  the  extended  gamma  model  already  has  the  information  about  the  events 
following  the  first  event.  The  memory  depth  of  each  anticausal  memory  bank  is  L) iq. 

With  the  increment  of  the  anticausal  memory,  the  extended  model  has  two  sets 
of  memory  banks:  one  handles  the  left  context  (the  causal  part);  the  other  handles 
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the  right  context  (the  anticausal  part).  The  window  size  of  the  extended  model  is 
K + L.  The  memory  depth  for  each  input  variable  is  the  sum  of  the  memory  depth 
of  the  causal  memory  bank  and  the  memory  depth  of  its  anticausal  counterpart,  i.e., 

K/ m + LI 

The  memory  depths  of  the  causal  memory  and  the  anticausal  memory  need  not  be 
the  same  since  the  temporal  context  may  be  asymmetric.  However,  since  the  memory 
depths  are  controlled  by  the  adaptive  memory  parameters,  the  context  window  can 
be  symmetric,  i.e.,  centered  at  the  current  point  in  time.  Different  memory  depths 
needed  for  the  causal  part  and  the  anticausal  part  can  be  achieved  by  adjusting  the 
memory  parameters.  So  the  order  of  the  causal  memory  (K)  and  the  order  of  the 
anticausal  memory  ( L ) are  set  to  the  same  number  unless  it  is  known  a priori  that 
more  left  context  or  right  context  is  needed. 

4.3.2  The  Focused  Architecture 

The  gamma  memory  structure  can  be  placed  in  any  portion  of  a neural  network. 
One  special  topological  example  is  the  focused  architecture  [Mozer,  1989].  As  defined 
by  Mozer  [1989],  the  focused  architecture  has  all  the  memory  elements  in  the  input 
layer.  And  the  recurrency  of  these  elements  is  restricted  to  one-to-one  connections. 
Also,  integration  over  time  in  the  memory  structure  is  linear.  This  “focused”  archi- 
tecture has  advantages  over  fully  recurrent  networks.  First,  the  error  estimates  in 
the  memory  structure  do  not  disperse  during  training.  Secondly,  back-propagation  in 
time  is  unnecessary  for  weight  adaptation.  The  focused  gamma  net  was  formulated 
and  tested  in  de  Vries  and  Principe  [1992]  and  Principe  et  al.  [1992].  The  focused 
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gamma  net  is  equivalent  to  the  TDNN  when  the  memory  parameters  s are  fixed  to 
one.  It  reduces  to  the  same  structure  as  in  Stornetta  et  al.  [1988]  when  the  memory 
order  K is  set  to  one. 

In  the  extended  gamma  model,  the  anticausal  memory  is  implemented  by  feeding 
the  data  sequence  in  reverse  order  to  the  memory  structure.  This  process  is  too 
complicated  to  be  implemented  in  the  hidden  layer  and  the  output  layer.  But  it  is 
straightforward  to  put  the  memory  structure  in  the  input  layer.  In  addition,  the 
purpose  of  the  extended  model  is  to  encode  the  context  of  the  input  sequence.  Thus, 
the  focused  architecture  can  meet  the  need. 

The  focused  gamma  net  with  the  extended  memory  structure  has  two  major 
components:  (1)  a short-term  memory  structure  that  saves  the  relevant  events  and 
(2)  an  associator  that  maps  the  short-term  memory  to  the  desired  outputs.  The  short- 
term memory  structure  is  divided  into  the  causal  part  and  the  anticausal  part.  The 
associator  is  a strictly  feedforward  structure.  Figure  4.2  demonstrates  such  a network. 
In  the  figure,  G,  represents  a gamma  memory  bank.  The  activations  of  all  the  taps 
in  each  memory  bank  are  sent  to  the  associator  through  weighted  connections. 

4.3.3  On-Line  Learning  and  Processing 

The  anticausal  memory  of  the  extended  gamma  model  is  implemented  by  running 
the  data  sequence  backward.  Previously,  the  learning  and  processing  of  a data  se- 
quence need  to  be  done  off  line,  i.e.,  after  the  whole  data  sequence  is  collected.  This 
is  somewhat  impractical.  Thus,  an  algorithm  for  on-line  learning  and  processing  is 


developed. 
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Figure  4.2.  The  focused  gamma  net  with  the  extended  memory  structure 


Recall  that  the  memory  depth  ( D ) of  an  anticausal  gamma  memory  bank  is  L/v, 
where  L is  the  order  of  the  memory  bank,  and  v is  the  memory  parameter  which  also 
represents  the  memory  resolution.  The  memory  depth  gives  us  an  idea  of  how  far  the 
data  information  is  reserved.  The  information  about  the  data  outside  the  range  of 
the  memory  depth  is  very  small  and  thus  can  be  ignored.  So  the  process  only  needs 
to  look  ceil(D)  time  slices  ahead.  Here,  ceil()  is  a function  which  finds  the  smallest 
integer  greater  than  or  equal  to  its  argument. 

The  memory  parameter  is  adaptive  during  the  learning  procedure.  And  the  mem- 
ory depth  changes  with  it.  To  simplify  the  loading  of  anticausal  memory,  we  can  set 
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the  maximum  depth  ( Dmax ),  needed  for  the  right  context  for  a particular  problem, 
to  a fixed  number.  The  anticausal  memory  can  be  loaded  with  a delay  of  only  Dmax 
time  slices.  With  a larger  Dmax,  more  computation  is  needed  to  load  the  anticausal 
memory.  When  the  Dmax  is  too  small,  the  subsequent  context  information  might  not 
be  enough.  Practically,  we  may  choose  Dmax  to  be  a bit  larger  than  expected. 

Since  the  associator  of  the  focused  gamma  net  is  strictly  feedforward,  standard 
back-propagation  can  be  used  to  train  the  weights.  As  for  the  adaptation  of  the 
memory  parameters,  either  back-propagation  through  time  (BPTT)  [Werbos,  1990] 
or  real-time  recurrent  learning  (RTRL)  [Williams  and  Zipser,  1989]  can  be  applied. 
Since  the  number  of  memory  parameters  is  small  in  most  cases,  the  RTRL  would 
be  efficient  here.  The  adaptation  of  causal  memory  parameters  (//,-)  of  the  focused 
gamma  net  was  derived  in  de  Vries  and  Principe  [1992].  The  error  gradients  can  be 
calculated  by  the  chain  rule 

wi,kJ  Pi,k)  (4-5) 

P*  j k 

where  j indexes  the  hidden  units,  8j  is  the  error  term  defined  in  Eq.  3.5,  and  = 
dxi'k/dfii.  From  differentiating  Eq.  4.2,  px%k  can  be  computed. 

Pi,o(t)  = 0 

Pi,k{t)  = (1  - Pi)  Pi,k(t  ~ 1)  + ( Xi,k-i(t  - 1)  - Xi>k(t  - 1) ) (4.6) 


Here,  pi,k(0)  is  defined  as  0. 
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The  adaptation  of  anticausal  memory  parameters  («/,)  can  be  derived  in  the  same 
way. 


= Ww  &.i)  (4J) 

* j i 

From  differentiating  Eq.  4.4,  gij  = dxij/dui  is  computed. 

Qi,o{t)  = 0 

Qi,i{ t)  = (l-i'i)Qi,i(t  + l)  + (xi'i-1(t  + l)-xitl(t  + l))  (4.8) 

Here,  gi}i(tc  + Dmax ) is  defined  as  0,  where  tc  indicates  the  current  time  slice. 

Context  analysis  by  the  extended  gamma  model  can  be  done  in  an  on-line  manner 
with  a delay  of  Dmax  time  steps.  The  RTRL  can  be  efficiently  applied  to  training  the 
memory  parameters.  Recall  that  the  window  size  of  the  extended  model  is  K + L. 
And  the  memory  depth  for  each  input  variable  is  A'///,,  + A/i/,-.  Since  we  want  the 
memory  depth  to  be  stretched  outside  both  ends  of  the  window,  both  //,■  and  i should 
be  less  than  1.  In  addition,  to  keep  the  system  stable,  /q  and  v±  need  to  be  in  the 
region  of  (0,  2).  Therefore,  all  the  memory  parameters  are  bounded  between  0 and  1 
when  the  extended  gamma  model  is  used  for  context  analysis. 

4.4  Discussion 

The  time  delay  neural  network  (TDNN)  [Waibel  et  al.,  1989;  Lang  et  al.,  1990] 
can  also  be  used  to  learn  information  about  adjacent  events  by  simply  setting  the 
reference  point  at  the  center  of  the  window.  But  its  memory  depth  is  fixed  to  the 
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window  size  (the  order  of  the  memory).  Therefore,  to  change  the  context  window,  a 
topological  modification  is  needed,  and  the  network  has  to  be  retrained  each  time. 
When  a broader  range  of  context  is  needed,  the  order  of  the  memory  becomes  larger, 
and  the  number  of  weights  would  increase  dramatically,  which  is  undesirable.  To 
avoid  trying  different  sizes  of  context  windows,  a priori  knowledge  about  the  window 
size  is  required  when  using  the  TDNN  for  context  analysis. 

The  partially  recurrent  networks  reviewed  in  Section  4.2  can  also  learn  the  context 
of  a sequence,  but  only  the  left  context.  The  ability  to  process  context  in  these 
networks  [Jordan,  1986;  Stornetta  et  ah,  1988;  Mozer,  1989;  Elman,  1990],  which 
have  a short-term  memory  structure  by  positive  feedback,  is  limited  to  an  exponential 
decay  over  time.  The  gamma  model  provides  a more  flexible  memory  structure. 

Until  now,  only  context  in  the  temporal  domain  is  considered  and  discussed.  How 
can  we  apply  the  extended  gamma  model  to  the  analysis  of  context  in  the  spatial 
domain?  Consider  a M x TV  image  in  Figure  4.3.  A square  context  window  (2 K x 2 K) 
can  be  created  to  encode  the  context  near  a reference  point.  For  a tapped-delay-line- 
like  structure,  a buffer  can  be  implemented  to  hold  the  4/i  2 points  in  the  context 
window.  But  the  context  is  confined  in  the  window  as  with  the  TDNN. 

With  the  gamma  memory  structure,  the  context  encoding  is  much  more  flexible. 
First,  the  expected  maximum  memory  depth  ( Dmax ) needs  to  be  specified  as  for 
on-line  processing  in  the  temporal  domain.  A set  of  4 Dmax  gamma  memory  banks 
with  order  K is  used  to  encode  row  context  from  the  data.  Then,  another  set  of  4 K 
memory  banks  with  order  K is  used  encode  column  context  from  the  tap  outputs  of 
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Figure  4.3.  A 2-D  gamma  context  window 


the  first  set  of  memory  banks.  This  is  illustrated  in  Figure  4.3.  A buffer  with  size 
4 K(Dmax  + K)  is  needed  to  store  all  temporary  encoded  values.  But  only  4 K2  of  the 
temporary  values  are  fed  into  the  associator.  Context  encompassing  a Dmax  x Dmax 
area  is  encoded  in  4 K2  points. 

During  the  training  phase,  the  context  window  is  moved  from  the  bottom-left 
corner  to  the  upper-right  corner  to  cover  the  whole  image.  The  error  gradients  of 
the  memory  parameters  in  the  second  set  of  memory  banks  can  be  estimated  by 
Eq.  4.7  and  Eq.  4.8.  The  only  difference  is  that  the  adjacency  is  now  in  space  instead 
of  in  time.  As  for  the  first  set  of  memory  banks,  the  BPTT  or  the  RTRL  can  be 
used  to  derive  equations  for  estimates  of  the  error  gradients  with  respect  to  each 
memory  parameter.  So  the  gamma  model  can  be  used  to  learn  spatial  context  of  a 
two-dimensional  image.  In  a similar  manner,  the  gamma  model  can  be  applied  to 
context  analysis  of  a three-dimensional  image. 


56 


4.5  Summary 

Context  is  an  essential  aspect  of  human  knowledge.  Yet  it  is  hard  to  completely 
formulate  this  kind  of  knowledge  in  rules.  Thus,  special  neural  models  are  used  to 
learn  the  context  from  the  data.  The  gamma  model  has  a flexible  and  adaptive  mem- 
ory form  which  is  suitable  for  encoding  context  information.  But  since  the  gamma 
memory  is  a causal  system,  only  proceeding  context  is  included  in  the  memory.  Thus, 
an  anticausal  part  is  added  to  the  original  gamma  model  to  include  following  con- 
text. The  extended  gamma  model  has  adaptive  memory  depths  for  both  prior  and 
subsequent  data  in  order  to  learn  both  the  left  and  right  contexts. 

In  this  chapter,  the  extended  gamma  model  for  context  analysis  is  described.  The 
on-line  processing  scheme  and  the  learning  algorithm  are  delineated.  The  comparison 
with  other  neural  models  is  made.  And  the  extension  to  context  analysis  in  space  is 


discussed. 


CHAPTER  5 
EEG  SLEEP  STAGING 

5.1  Introduction 

Sleep  has  been  categorized  into  six  stages:  stage  0 (awake),  1,  2,  3,  4,  and  REM 
(rapid  eye  movement).  Sleep  staging  is  traditionally  done  by  human  experts  reading 
electroencephalogram  (EEG),  electrooculogram  (EOG),  and  electromyogram  (EMG) 
recordings.  Sleep  EEG/EOG/EMG  recordings  are  divided  into  30-  or  60-second 
epochs.  Each  epoch  is  classified  into  one  of  the  six  stages.  It  is  inevitably  a time- 
consuming  task  since  the  paper  tracing  of  an  eight-hour  EEG  recording  is  over  400 
meters  long  [Smith,  1986].  Furthermore,  the  human  experts  are  sometimes  inconsis- 
tent and  inaccurate  in  scoring  sleep  stages.  For  example,  it  is  reported  in  Kubicki 
et  al.  [1989]  that  an  expert  scoring  the  same  recording  twice  can  result  in  5.7%  dif- 
ference. The  overall  agreement  of  sleep  scoring  between  human  experts  is  about  90% 
when  the  experts  are  from  the  same  laboratory  and  could  be  lower  than  80%  if  they 
are  from  different  laboratories.  The  computer  can  provide  a much  more  objective 
and  accurate  description  of  sleep  stages  than  human  experts.  Therefore,  automated 
sleep  staging  is  desired. 

Since  the  1960s,  a variety  of  methods  have  been  tried  to  score  sleep  stages  auto- 
matically. In  recent  years,  commercial  systems  for  automated  sleep  staging  have  been 
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introduced.  But  no  systems  can  consistently  score  sleep  stages  over  90%  accuracy. 
In  addition,  sleep  staging  is  still  an  important  and  challenging  engineering  problem 
because  it  addresses  codification  of  human  knowledge.  Thus  sleep  staging  is  chosen 
as  the  application  domain  of  this  research. 

Principe  and  Smith  [1986]  developed  an  EEG/EOG/EMG  waveform  detection 
system.  The  system  can  summarize  numbers  or  duration  of  relevant  waveforms  in 
tokens.  These  sleep  tokens  are  then  used  for  automated  sleep  staging.  Several  sys- 
tems have  been  designed  to  process  these  tokens  [Chang,  1987;  Principe  et  ah,  1989; 
Principe  and  Tome,  1989].  The  overall  agreement  between  the  systems  and  human 
experts  was  about  85%  [Principe  et  ah,  1993].  This  result  is  below  the  required 
agreement  level  between  human  experts  in  most  laboratories  - 90%.  Thus  further 
improvement  is  needed. 

A neural  network-based  hybrid  system  has  been  developed  for  sleep  staging.  The 
information  processor  of  the  system  is  a rule-based  neural  network  augmented  with 
the  fuzzification  layer  and  the  extended  gamma  model. 

In  this  chapter,  definitions  and  some  clinical  criteria  of  sleep  stages  are  described. 
Automated  methods  for  sleep  staging  are  reviewed.  Neural  network  approaches  are 
discussed.  And  the  neural  network-based  hybrid  system  for  sleep  staging  is  presented. 

5.2  Definitions  and  Clinical  Criteria 

A set  of  definitions  of  sleep  stages  was  published  in  Rechtschaffen  and  Kales 
(R+K)  [1968].  The  definitions  have  since  become  the  international  standard.  Sleep 
stages  are  described  by  the  existence  of  certain  EEG/EOG/EMG  activities.  Thus 
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Table  5.1.  Criteria  for  detection  of  EEC  activities 


Activities 

Frequency  Window  (Hz) 

Amplitude  Threshold  (//V) 

Alpha  spindle 

7.5  - 12.0 

7.0 

Beta  spindle 

16.0  - 32.0 

3.0 

Theta  spindle 

2.0  - 7.5 

10.0 

Sigma  spindle 

12.0  - 16.0 

5.0 

Delta 

0.5  - 2.0 

16.7 

K-complexes 

0.5  - 2.0 

70.0 

REM 

0.5  - 3.0 

30.0 

SEM 

0.2  - 5.0 

15.0 

Muscle 

above  32 

5.0 

EEG/EOG/EMG  activities  relevant  to  sleep  staging  are  introduced  before  the  defi- 
nitions of  sleep  stages. 

The  EEG  recording  can  be  analyzed  by  its  frequency  and  amplitude.  Generally 
speaking,  when  a person  is  more  relaxed,  especially  during  sleep,  his/her  EEG  activity 
has  higher  amplitude  and  lower  frequency.  This  is  known  as  high-voltage  slow  activity. 
When  a person  is  wide  awake,  especially  in  an  excited  state,  the  EEG  activity  has 
lower  amplitude  and  high  frequency.  This  is  known  as  low- voltage  fast  activity  [Davies 
and  Horne,  1975].  EEG  activities  relevant  to  sleep  staging  are  listed  in  Table  5.1  by 
their  respective  frequency  ranges  and  amplitude  thresholds. 

The  definitions  of  sleep  stages  according  to  Rechtschaffen  and  Kales  [1968]  are 
summarized  as  follows: 


(a)  Stage  0 (awake): 

This  stage  corresponds  to  the  waking  state.  It  is  characterized  by  alpha  activity 
with  some  beta  activity.  This  stage  is  usually  accompanied  by  a high  tonic 
EMG.  Rapid  eye  movements  (REMs)  and  eye  blinks  are  evident. 
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(b)  Stage  1: 

This  stage  is  defined  by  a relatively  low  amplitude  mixed  frequency  EEG  activ- 
ity with  a prominence  of  theta  activity.  Vertex  sharp  waves  may  appear  towards 
the  end  of  the  stage.  The  EOG  shows  slow  eye  movements  (SEMs),  and  there 
is  a fairly  high  tonic  EMG  which  is  generally  below  that  in  Stage  0.  Stage  1 
occurs  most  often  in  the  transition  from  wakefulness  to  other  sleep  stages  and 
tends  to  be  relatively  short.  This  stage  occupies  approximately  5 to  10%  of  the 
total  sleep  time  of  the  young  adult. 

(c)  Stage  2: 

This  stage  is  defined  by  the  presence  of  sleep  (sigma)  spindles  and  K-complexes. 
There  is  a quantitative  absence  of  delta  activity  to  preclude  it  from  Stage  3. 
The  EOG  is  placid  and  the  EMG  is  lower  than  that  in  Stage  1.  This  stage 
occupies  40  to  50%  of  the  total  sleep  time. 

(d)  Stage  3: 

In  this  stage,  delta  activity  appears  for  at  least  20%,  but  not  more  than  50%, 
of  the  time.  There  may  also  be  K-complexes  and  sleep  spindles  present.  EMG 
and  EOG  are  of  a similar  level  to  that  in  Stage  2.  Stage  3 is  considered  to  be  a 
transition  between  Stages  2 and  4.  It  occupies  5 to  10%  of  the  total  sleep  time. 

(e)  Stage  4: 

In  this  stage,  delta  activity  appears  for  a minimum  of  50%  of  the  time  and 
frequently  the  EEG  record  is  completely  dominated  by  this  activity.  Sleep 
spindles  may  be  present.  The  EOG  is  placid  and  the  EMG  tends  to  be  low, 
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but  still  higher  than  that  in  Stage  REM.  This  stage  occupies  15  to  20%  of  the 
total  sleep  time. 

(f)  Stage  REM: 

This  stage  is  characterized  by  the  presence  of  low- voltage  fast  activity  of  mixed 
frequency  and  episodic  REMs.  The  EMG  reaches  the  lowest  level  found  in  any 
stage.  Stage  REM  is  of  particular  interest  to  sleep  researchers  because  it  is  the 
state  during  which  most  dreaming  occurs.  This  stage  occupies  20  to  25%  of  the 
total  sleep  time.  (For  simplicity,  we  sometimes  call  this  stage  Stage  5 in  this 
dissertation.) 

Sleep  EEG/EOG/EMG  recordings  are  divided  into  30-  or  60-second  epochs  for 
scoring  purposes.  Each  epoch  is  classified  into  one  of  the  six  stages.  Stages  1 to  4 
are  generally  continuous.  But  Stage  REM  could  happen  quite  suddenly  at  any  time 
during  sleep. 

Delta  sleep  (Stages  3 and  4),  and  REM  sleep  (Stage  REM)  are  distributed  in  a 
different  manner  within  nocturnal  sleep.  Delta  sleep  occurs  mainly  at  the  first  half  of 
the  sleep.  On  the  other  hand,  REM  sleep  appears  throughout  the  night.  The  duration 
of  REM  sleep  becomes  longer  following  the  progress  of  the  sleep.  This  results  in  that 
the  majority  of  REM  sleep  occurs  in  the  second  half  of  the  sleep  [Davies  and  Horne, 
1975].  Examples  of  sleep  stage  profiles  in  the  Appendix  support  this  claim. 

Next,  a set  of  clinical  criteria  for  sleep  staging  is  presented  [Williams  et  ah,  1974]. 
The  criteria  were  used  in  the  Florida  laboratory,  which  were  derived  through  a gradual 
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evolution  from  the  Dement-Kleitman  method  [Dement  and  Kleitman,  1957].  They 
can  be  applied  to  EEG  recordings  divided  into  one-minute  epochs. 

(a)  Stage  0:  At  least  30  seconds  of  alpha  activity  with  a minimum  peak  amplitude 
of  40//V. 

(b)  Stage  1:  Less  than  30  seconds  of  alpha  activity,  and  no  more  than  one  well- 
defined  sigma  spindle  or  K-complex;  if  a subject  does  not  display  clear  alpha 
activity,  muscle  artifact  and  eye  movement  tracings  are  used. 

(c)  Stage  2:  At  least  two  well-defined  spindles,  or  at  least  two  K-complexes,  or 
one  of  each;  no  more  than  12  seconds  of  delta  activity  with  a minimum  peak 
amplitude  of  40pV. 

(d)  Stage  3:  At  least  13  seconds,  but  less  than  30  seconds,  of  delta  activity. 

(e)  Stage  4:  At  least  30  dominant  seconds  of  delta  activity. 

(f)  Stage  REM:  Stage  1 with  REM  episodes. 

The  criteria  are  based  on  the  Dement-Kleitman  method  with  modifications  ac- 
cording to  clinical  experience.  It  is  shown  that  the  criteria  result  in  a 90+%  level  of 
intra-  and  inter-scorer  reliability  [Williams  et  al.,  1974].  This  set  of  criteria  is  used 
to  formulate  expert  rules  because  of  its  clarity. 

5.3  A Review  of  Automated  Methods 

Amplitude  analysis  and  spectral  analysis  of  EEG  has  been  tried  respectively  for 
automated  sleep  staging  in  the  late  1960s.  The  researchers  concluded  that  either  of 
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them  does  not  suffice  for  accurate  sleep  staging.  Researchers  added  EOG  and  EMG 
data  in  the  1970s  to  help  improving  the  accuracy  [Smith,  1986].  It  was  reported  that 
most  systems  had  overall  man-machine  agreement  from  82  to  84%  in  the  early  1970s. 
But  the  results  were  obtained  from  analyzing  the  sleep  of  only  young  adult  subjects. 
Also,  different  parameter  settings  were  needed  for  different  subjects  [Smith  et  al., 
1978]. 

Smith  et  al.  [1978]  were  the  first  to  develop  a system  which  can  summarize  minute- 
by-minute  information  of  EEG  waveforms.  Different  parameter  settings  for  different 
subjects  were  not  necessary  for  the  system.  A subject  population  with  a large  age 
variation  (5  - 79  years  old)  was  used  to  test  the  system.  The  overall  performance 
of  the  system  was  about  80%.  Another  similar  system  by  Hasan  [1983]  achieved 
80.5%  man-machine  agreement  for  a group  of  young  adults  and  78.3%  for  three  older 
groups.  The  waveform  detectors  of  these  two  systems  gave  over  90%  accuracy.  But 
the  sleep  staging  systems  only  gave  approximately  80%  accuracy. 

In  recent  years,  commercial  systems  for  automated  sleep  staging  have  been  in- 
troduced. But  no  systems  can  consistently  score  sleep  stages  over  90%  accuracy. 
Comparison  of  sleep  scoring  by  human  experts  and  a commercial  system  (the  Oxford 
Sleep  Stager)  can  be  found  in  Kubicki  et  al.  [1989].  The  performance  of  the  system 
was  not  as  good  as  expected.  It  was  reported  that  the  agreement  between  the  sleep 
stager  and  the  consensus  of  two  experts  was  only  73.1%. 

The  main  difficulty  in  developing  a robust  sleep  staging  system  is  the  intra-  and 
inter-subject  variability  of  sleep.  The  inconsistency  of  human  experts  also  contributes 
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to  the  disagreement  between  human  and  machine  scorings.  Furthermore,  because 
experts  from  different  laboratories  apply  standard  rules  in  different  ways  and  the 
rules  might  not  cover  some  sleep  patterns  in  some  sleep  recordings,  the  variability 
of  intergroup  scoring  might  be  large.  In  Ferri  et  al.  [1989],  sleep  experts  from  nine 
different  laboratories  were  asked  to  score  five  sets  of  sleep  recordings.  A consensus 
scoring  was  obtained  for  each  epoch  by  finding  the  most  scored  stage.  An  epoch  was 
classified  as  “dubious”  if  such  a scoring  could  not  be  determined.  The  intragroup 
scorings  seemed  to  be  good  and  reliable.  But  intergroup  scorings  were  very  different 
in  some  cases.  The  results  showed  that  the  agreement  between  the  scorings  of  the 
groups  and  the  consensus  scoring  ranged  from  60.9%  to  96%. 

An  EEG/EOG/EMG  waveform  detection  system  was  developed  by  Principe  and 
Smith  [1986].  The  number  and  duration  of  the  EEG/EOG/EMG  waveforms  relevant 
to  sleep  staging  during  each  minute  can  be  summarized  in  a token  by  the  system. 
It  is  assumed  that  the  minute  tokens  contain  all  the  information  needed  for  sleep 
staging  by  computers.  In  previous  work,  different  techniques  were  tried  to  further 
process  the  tokens  for  the  purpose  of  sleep  staging.  For  example,  an  expert  system 
was  developed  in  Chang  [1987];  a belief  automaton  was  utilized  in  Principe  et  al. 
[1989];  and  single  and  multi-layer  perceptrons  were  used  in  Principe  and  Tome  [1989]. 
The  aforementioned  three  information  processing  models  have  been  analyzed  and 
compared  in  Principe  et  al.  [1993].  The  overall  agreement  between  the  systems 
and  human  experts  was  about  85%.  Among  them,  the  expert  system  has  the  best 
performance,  but  it  is  a nontrivial  and  long  engineering  task  to  develop.  On  the  other 
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hand,  the  neural  network  is  the  easiest  one  to  implement,  and  it  has  comparable 
results  to  the  other  two. 

5.4  Neural  Network  Approaches 

Sleep  staging  can  be  divided  into  two  major  tasks.  The  first  task  is  the  recogni- 
tion of  different  EEG/EOG/EMG  waveforms.  The  second  one  is  the  processing  of  the 
waveform  information  to  score  sleep  stages.  Some  researchers  have  used  neural  net- 
works for  the  detection  of  EEG  waveforms  [Jando  et  al.,  1993;  Principe  and  Zahalka, 
1993].  This  research  concentrates  on  information  processing  of  sleep  EEG  tokens, 
obtained  from  the  system  by  Principe  and  Smith  [1986],  using  neural  networks. 

In  Principe  and  Tome  [1989],  single  and  multi-layer  perceptrons  have  been  tried 
for  this  purpose.  The  EEG/EOG/EMG  tokens  were  discretized  into  three  or  four 
levels  before  fed  to  the  neural  networks.  The  issue  in  the  effort  was  mainly  selection 
of  the  training  data.  Twenty- four  artificial  patterns  were  found  to  be  efficient  in 
training  the  networks  to  distinguish  four  sleep  classes  (0,  1/5,  2,  and  3/4).  Stages 
3 and  4 were  then  differentiated  by  thresholding  the  delta  activity.  Because  context 
information  is  needed  for  determining  Stage  5,  human  decisions  were  involved  in  the 
final  classification  of  Stages  1 and  5.  The  nets  were  trained  only  by  the  information 
in  each  epoch  without  considering  the  context  among  adjacent  epochs. 

Principe  and  Tome  [1989]  used  only  the  simplest  neural  network  techniques.  Even 
though  the  results  showed  that  the  neural  network  is  a promising  approach  to  au- 
tomated sleep  staging,  improvement  of  overall  performance  is  very  much  needed. 
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Recent  progress  on  neural  network  research  can  be  exploited  to  enhance  the  perfor- 
mance of  sleep  staging  by  neural  networks,  for  example,  the  utilization  of  knowledge 
and  fuzziness  into  the  neural  network  structure,  and  the  special  designs  of  neural 
models  for  context  learning. 

The  definitions  of  sleep  stages  are  derived  from  expert  knowledge  about  sleep.  We 
know  that  humans  are  not  always  precise  in  making  decisions.  In  the  R+K  definitions, 
most  rules  are  stated  in  linguistic  terms  rather  than  numerical  ones.  Therefore,  fuzzy 
sets  can  be  utilized  to  provide  a better  representation  of  the  variables.  To  incorporate 
the  fuzzification  process  into  the  neural  network  approach,  the  fuzzification  method 
depicted  in  Chapter  3 can  be  used. 

Since  standard  rules  of  EEG  sleep  staging  do  exist,  it  is  desirable  to  utilize  the 
existent  knowledge  about  sleep  staging,  and  to  take  advantage  of  the  learning  power 
of  neural  networks  at  the  same  time.  The  knowledge-based  neural  network  which 
incorporates  expert  knowledge  into  the  neural  network  framework  is  exactly  what  we 
need  here.  It  has  been  shown  that  knowledge-based  neural  networks  can  outperform 
multilayer  perceptrons  [Towell  et  ah,  1990],  The  KBCNN  method  introduced  in 
Chapter  2 can  be  used  to  construct  the  initial  neural  network  from  rules  based  on 
the  clinical  criteria  in  Section  5.2. 

The  above-mentioned  neural  networks  are  all  static,  which  has  no  short-term 
memory  capability.  Thus,  only  the  information  within  each  epoch  is  exploited  for 
making  decisions.  The  staging  results  with  only  the  information  within  each  minute 
are  referred  to  as  primary  stages.  When  human  experts  score  sleep  stages,  they  need 
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to  check  waveform  information  of  adjacent  epochs  for  determining  certain  stages  (e.g., 
REM  for  Stage  5)  and  for  reducing  unnecessary  sleep  stage  transitions.  So  a procedure 
is  needed  to  remedy  the  lack  of  context  information  in  decision  making.  A sliding 
window  centered  at  the  current  epoch  for  contextual  interpretation  was  used  in  an 
expert  system  implementation  to  smooth  the  primary  stages  [Chang,  1987].  However, 
the  window  size  is  limited,  and  the  context  rules  are  not  complete.  Until  now,  a 
precise  understanding  of  sleep  context  has  not  yet  been  done.  The  extended  gamma 
model  developed  for  context  analysis  can  be  used  to  learn  the  context  information  of 
sleep. 

5.4.1  A Hybrid  System  for  Sleep  Staging 

When  human  experts  score  the  sleep  EEC  data,  they  recognize  the  relevant  events 
first.  Then,  they  make  decisions  based  on  the  occurrence  and  duration  of  the  events. 
Sometimes,  they  have  to  check  the  temporal  context  to  refine  the  decisions.  A hybrid 
system  contained  three  subsystems  is  designed  for  automated  sleep  staging  using 
neural  network  techniques.  Each  subsystem  corresponds  to  a part  of  the  human 
decision  making. 

There  are  three  processing  blocks  in  this  system  (see  Figure  5.1).  The  first  block 
is  comprised  of  the  waveform  detectors  which  recognize  the  related  EEG/EOG/EMG 
events  and  summarize  them  in  minute  tokens.  Neural  networks  can  be  used  for  this 
purpose.  Currently,  the  system  developed  by  Principe  and  Smith  [1986]  is  used. 
The  second  block  is  a static  classifier  which  processes  token  information  within  each 
minute  and  produces  primary  stages.  The  rule-based  neural  network  (RBNN)  with 
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Static  Context 

EEG  Preprocessor  Classifier  Analyst 


Figure  5.1.  The  block  diagram  of  a hybrid  system  for  automated  sleep  staging 


an  extra  fuzzification  layer  is  employed.  In  the  third  block,  a context  analyst  is 
needed  to  incorporate  the  context  information.  The  extended  gamma  model  is  used 
to  analyze  the  temporal  context  and  produce  the  final  staging  decision. 

5.5  Summary 

Sleep  stages  are  defined  by  the  occurrence  and  duration  of  EEG/EOG/EMG 
activities.  The  standard  definitions  are  consistent  but  incomplete  because  the  elec- 
trophysiological  signals  are  unconstrained.  Human  experts  score  sleep  stages  subjec- 
tively under  the  incompleteness  of  standard  rules.  Furthermore,  human  is  imprecise 
and  inconsistent.  Therefore,  an  automated  system  is  needed  to  provide  a standard 
and  objective  means  in  scoring  sleep  stages  and  to  serve  as  a labor-saving  tool. 

Neural  networks  can  be  employed  to  learn  different  sleep  patterns  from  the  token 
data.  But  because  the  data  are  unconstrained,  a few  sets  of  all  night  sleep  data  might 
not  cover  the  pattern  space  well.  The  incompleteness  of  both  human  knowledge  and 
training  data  makes  this  problem  suitable  for  the  knowledge-based  neural  network. 
The  initial  knowledge-based  neural  network  is  built  based  on  the  rules  of  sleep  staging. 
Then  the  performance  of  this  neural  network  is  improved  by  adaptation  on  the  sleep 
data.  The  two  aspects  of  human  knowledge  - fuzziness  and  context,  which  are  not 
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considered  in  the  framework  of  knowledge-based  neural  networks,  can  be  compensated 
by  adding  a fuzzification  layer  and  a context-processing  neural  model.  A hybrid 
system  for  sleep  staging  is  hence  constructed. 


CHAPTER  6 
EXPERIMENTS 

6.1  Introduction 

The  main  focus  of  this  dissertation  is  on  the  integration  of  human  knowledge  and 
neural  networks.  The  knowledge-based  neural  network  deals  with  direct  translation 
of  human  knowledge  into  a neural  network  structure.  The  fuzzy  neural  network 
takes  the  fuzzy  nature  of  human  knowledge  into  consideration.  It  provides  a better 
representation  of  the  input  and  output.  And  the  context-processing  neural  network 
learns  context  information,  which  is  also  an  aspect  of  human  knowledge,  but  it  is 
hard  to  formulate. 

EEG  sleep  staging  involves  the  codification  of  human  knowledge.  Human  knowl- 
edge about  sleep  staging  has  been  formulated  as  standard  definitions  through  the 
extensive  experience  of  experts.  It  is  desirable  to  tackle  the  sleep  staging  problem  by 
utilizing  the  existent  knowledge,  and  at  the  same  time  taking  advantage  of  the  learn- 
ing power  of  neural  networks.  So  the  sleep  staging  problem  is  a good  application 
domain  for  this  research.  Several  experiments  are  designed  to  evaluate  the  afore- 
mentioned neural  models.  And  the  overall  performance  of  the  neural  network-based 
hybrid  system  is  compared  with  previous  results  by  other  approaches. 
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The  first  experiment  compares  the  perceptron,  the  multilayer  perceptron  (MLP), 
and  the  knowledge-based  conceptual  neural  network  (KBCNN)  on  the  primary  stag- 
ing of  sleep  using  24  artificial  patterns  as  the  training  set.  The  second  experiment 
evaluates  the  effect  of  the  fuzzification  layer  on  the  performance  of  the  MLP  and 
the  KBCNN.  All-night  token  data  are  used  as  the  training  set.  The  “leave-one-out” 
experimental  methodology  [Fukunaga,  1972]  is  used  here  as  well  as  in  the  other  ex- 
periments to  evaluate  the  generalization  capability  of  the  neural  networks.  This  is 
a strict  way  to  test  a neural  network.  Assume  that  there  are  N sets  of  data.  Data 
from  N — 1 sets  are  used  as  the  training  set.  The  data  set  left  out  is  then  used  as 
the  test  set.  The  experiment  repeats  N times  with  different  data  sets  as  the  test  set. 
The  third  experiment  demonstrates  the  learning  capability  of  the  extended  gamma 
model  on  sleep  context.  It  compares  three  different  approaches.  The  fourth  experi- 
ment uses  five  selected  sleep  records  as  the  training  set,  and  tests  the  trained  neural 
networks  on  the  rest  of  the  sleep  records.  The  fifth  experiment  divides  sleep  records 
according  to  subject  ages  with  an  intention  to  reduce  intersubject  variability.  All  the 
neural  models  were  implemented  by  computer  programs  written  in  C.  The  computer 
programs  are  run  on  the  Sun  workstation. 

In  this  chapter,  the  sleep  records  used  in  the  experiments  are  introduced  and 
briefly  analyzed  first.  Then  the  experiments  are  described,  and  the  results  are  pre- 


sented and  discussed.  Discussions  are  made  at  the  end. 
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6.2  Sleep  Records 

Twelve  sets  of  EEG/EOG/EMG  tokens  used  in  Chang  [1987]  are  chosen  as  the 
experimental  data  in  this  research.  The  basic  information  of  these  records  is  summa- 
rized in  Table  6.1,  which  includes  the  age  of  the  subject,  the  total  length  of  the  record 
in  minutes,  and  the  distribution  of  sleep  stages  in  both  minute  and  percentage.  The 
sleep  stage  profiles  of  the  sleep  records  are  provided  with  figures  in  the  Appendix. 
The  records  can  be  divided  into  three  age  groups:  Group  I (13  - 18  years  old),  Group 
II  (25  - 34  years  old),  and  Group  III  (43  - 70  years  old).  From  the  table  and  the 
figures,  we  can  see  different  sleep  patterns  for  different  age  groups.  For  example, 
older  subjects  tend  to  lack  deep  sleep  (delta  sleep),  i.e.,  Stages  3 and  4,  and  have 
more  sleep  interruptions  (Stage  0)  after  the  onset  of  sleep  than  the  younger  ones. 

Only  five  of  the  sleep  records  are  used  in  most  experiments  since  these  selected 
records  cover  the  full  age  range.  These  records  are  marked  with  an  asteroid  by  their 
record  numbers  in  Table  6.1.  They  are  the  records  used  in  Principe  et  al.  [1993]  which 
reported  experimental  results  of  an  expert  system,  a belief  automaton,  and  neural 
networks.  Thus,  the  experimental  results  of  this  work  can  be  directly  compared  with 
the  previous  ones. 

The  information  about  six  events  relevant  to  sleep  staging  is  used  as  the  input 
to  the  neural  network.  The  information  includes  the  number  of  events  detected  for 
sigma  spindles  and  REM  waves,  and  the  duration  of  alpha,  beta,  delta  waves,  and 
muscle  activities  in  each  minute.  The  desired  output  of  the  neural  network  is  the 
human  sleep  scoring  which  is  the  consensus  of  two  experts. 
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Table  6.1.  EEG/EOG/EMG  recordings  and  their  sleep  stage  distributions 


Record 

Age 

Total 

STG0 

STGl 

STG2 

STG3 

STG4 

STG5 

10109* 

13 

468 

6 

21 

221 

16 

65 

138 

1.3% 

4.5% 

47.3% 

3.4% 

13.9% 

29.6% 

10158 

13 

524 

54 

13 

274 

27 

66 

90 

10.3% 

2.5% 

52.3% 

5.2% 

12.6% 

17.2% 

10146 

13 

501 

28 

15 

197 

47 

94 

120 

5.6% 

3.0% 

39.3% 

9.4% 

18.8% 

24.0% 

11623* 

18 

397 

50 

8 

209 

11 

61 

58 

12.6% 

2.0% 

52.6% 

2.8% 

15.4% 

14.6% 

11769 

25 

471 

13 

11 

237 

33 

31 

156 

2.8% 

2.3% 

50.3% 

4.9% 

6.6% 

33.1% 

11717* 

27 

421 

13 

8 

219 

20 

43 

118 

3.1% 

1.9% 

52.0% 

4.8% 

10.2% 

28.0% 

11747 

29 

443 

11 

6 

212 

27 

51 

136 

2.5% 

1.4% 

47.9% 

6.1% 

11.5% 

30.7% 

10244 

34 

424 

20 

17 

242 

35 

0 

110 

4.7% 

4.0% 

57.1% 

8.3% 

0.0% 

25.9% 

10067* 

43 

449 

75 

62 

177 

17 

0 

118 

16.7% 

13.8% 

39.4% 

3.8% 

0.0% 

26.3% 

11771 

53 

446 

17 

28 

321 

19 

0 

61 

3.8% 

6.3% 

72.0% 

4.3% 

0.0% 

13.7% 

10889 

53 

435 

23 

19 

265 

0 

0 

128 

5.3% 

4.4% 

60.9% 

0.0% 

0.0% 

29.4% 

11740* 

70 

435 

104 

29 

233 

0 

0 

69 

23.9% 

6.7% 

53.6% 

0.0% 

0.0% 

15.9% 

The  tokens  of  relevant  events  can  be  discretized  according  to  Table  6.2  which  is 
obtained,  with  a minor  change,  from  Principe  and  Tome  [1989].  Levels  2 and  3 of 
the  delta  wave  are  combined  into  one  level  (Level  2)  when  only  four  sleep  classes  (0, 
1/5,  2,  and  3/4)  are  the  classification  targets.  Levels  0,  1,  2,  and  3 correspond  to 
SCARCE,  LOW,  MEDIUM,  and  HIGH,  respectively,  in  linguistic  terms. 

Twenty-four  artificial  training  patterns  were  designed  by  Principe  and  Tome  [1989] 
using  human  knowledge  about  sleep  staging  (see  Table  6.3).  In  the  table,  the  numbers 
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Table  6.2.  Discretization  of  EEG/EOG/EMG  tokens 


Levels 

0 

1 

2 

3 

Alpha  (sec) 

0-5 

6-10 

11-30 

> 30 

Beta  (sec) 

0-5 

6-10 

11-30 

> 30 

Delta  (sec) 

0-5 

6-12 

13-30* 

> 30* 

Sigma  (#) 

0 

1 

2-3 

> 3 

Muscle  (sec) 

0-5 

6-10 

11-20 

> 20 

REM  (#) 

0 

1 

2-3 

> 3 

indicate  the  levels  of  the  activities  as  shown  in  Table  6.2.  The  first  four  patterns  of 
each  sleep  class  are  distinct  exemplars;  the  rest  two  are  random  combinations  of 
possible  cases.  These  artificial  patterns  have  been  shown  to  be  efficient  in  training 
neural  networks  to  distinguish  four  sleep  classes.  Neural  networks  generalize  pretty 
well  on  the  artificial  patterns. 

6.3  Experiment  1:  Primary  Staging  with  Artificial  Patterns 

The  first  task  is  to  learn  the  four  sleep  classes:  Stages  0,  1/5,  2,  and  3/4  using 
the  24  artificial  patterns.  The  perceptron,  the  multilayer  perceptron  (MLP),  and  the 
knowledge-based  conceptual  neural  network  (KBCNN)  are  used,  and  the  results  are 
compared. 

A set  of  rules  based  on  the  clinical  criteria  in  Section  5.2.  for  scoring  four  sleep 
classes  is  shown  in  Table  6.4.  The  KBCNN  transformed  from  the  rules  performs  at 
78.6%  accuracy  on  the  five  selected  records  without  any  training  when  the  CF-based 
activation  function  is  used.  The  accuracy  is  so  high  because  the  initial  weights  are 
assigned  to  suit  the  certainty  factor  model  [Fu,  1993].  When  the  rules  are  used  as 
the  rule  base  of  a simple  expert  system,  and  a simple  pattern  matcher  is  used  as  the 
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Table  6.3.  Artificial  training  patterns  for  four  sleep  classes  [Principe  and  Tome,  1989] 


Alpha 

Beta 

Delta 

Sigma 

Muscle 

REM 

2 

0 

0 

0 

2 

2 

2 

2 

0 

0 

3 

3 

STG  0 

3 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

0 

2 

2 

0 

0 

0 

0 

3 

1 

1 

1 

1 

2 

0 

3 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

STG  1/5 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

3 

0 

0 

1 

1 

1 

2 

1 

1 

1 

1 

0 

0 

0 

1 

0 

0 

0 

0 

0 

2 

0 

0 

STG  2 

0 

0 

0 

3 

0 

0 

0 

0 

1 

2 

1 

1 

1 

1 

1 

2 

2 

2 

0 

2 

0 

3 

1 

1 

0 

0 

2 

1 

0 

0 

0 

0 

2 

1 

0 

0 

STG  3/4 

0 

0 

2 

0 

0 

0 

0 

0 

2 

3 

2 

2 

0 

0 

2 

3 

0 

0 

1 

1 

2 

3 

1 

1 

inference  engine,  the  expert  system  performs  at  74.5%  accuracy  on  the  five  selected 
records.  This  result  is  worse  than  that  of  the  KBCNN  because  no  certainty  factors 
are  involved  in  the  inference.  When  two  or  more  conflict  rules  fire  for  the  same 
pattern,  the  pattern  is  counted  as  misclassified.  It  can  be  shown  later  that  training 
the  KBCNN  improves  the  classification  accuracy  by  refining  the  rules.  The  number 
of  free  parameters  is  only  34  for  the  KBCNN.  In  order  to  learn  the  token  patterns 
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Table  6.4.  A set  of  rules  for  classification  of  four  sleep  classes 


R1 

Alpha3 

then  StageO 

R2 

SigmaO  and  (not  Alpha3) 

then  Stagel/5 

R3 

Sigmal  and  (not  Alpha3) 

then  Stagel/5 

R4 

DeltaO  and  Sigma2 

then  Stage2 

R5 

Deltal  and  Sigma2 

then  Stage2 

R6 

DeltaO  and  Sigma3 

then  Stage2 

R7 

Deltal  and  Sigma3 

then  Stage2 

R8 

Delta2 

then  Stage3/4 

Table  6.5.  Comparison  of  neural  networks  on  classification  of  four  sleep  classes 


Cases 

Perceptron 

MLP 

KBCNN 

(sigmoid) 

KBCNN 

(CF) 

24  Patterns  (%) 

85.28 

85.86 

85.12 

84.28 

12  Patterns  (%) 

79.06 

79.58 

82.40 

84.34 

Free  Parameters  (#) 

96 

284 

90 

90 

Hidden  Units  (#) 

0 

10 

10 

10 

which  do  not  follow  the  rules,  two  extra  hidden  units  with  full  connections  to  the 
input  units  and  the  output  units  are  added  to  the  IvBCNN. 

In  Table  6.5,  the  perceptron,  the  multilayer  perceptron  with  a hidden  layer  of 
10  hidden  units,  and  the  KBCNN  with  two  extra  hidden  units  are  compared.  Two 
activation  functions  - the  sigmoid  function  and  the  CF-based  function  - were  tried 
for  the  KBCNN.  First,  the  24  artificial  patterns  were  used  to  train  the  nets.  Then, 
the  training  set  was  split  into  halves  to  see  the  influence  of  incompleteness  of  training 
data  on  the  performance  of  the  networks.  The  first  three  patterns  of  each  sleep  class 
were  picked,  resulting  in  a training  set  of  12  patterns.  The  trained  neural  networks 
were  tested  with  the  five  selected  records,  and  the  average  percentages  are  presented. 
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The  numbers  of  free  parameters  and  hidden  units  of  the  networks  are  also  compared 
in  the  table. 

For  the  classification  of  the  four  sleep  classes,  the  KBCNN  performs  at  the  same 
level  as  the  single-  and  multi-layer  perceptions.  With  fewer  connections,  the  KBCNN 
has  advantages  of  faster  training  and  processing  over  the  fully  connected  MLP.  The 
KBCNN  is  also  superior  to  the  single-  and  multi-layer  perceptrons  under  the  condi- 
tion of  incomplete  training  data.  When  only  the  first  12  artificial  training  patterns 
were  used  to  train  the  nets,  the  KBCNN  performed  significantly  better  than  the 
single-  and  multi-layer  perceptrons,  especially  when  the  CF-based  function  was  used. 
This  is  because  that  the  KBCNN  has  existent  knowledge  mapping  from  the  rules. 
Furthermore,  the  CF-based  function  tends  to  preserve  the  knowledge  better  than  the 
sigmoid  function  [Fu,  1993]. 

6.4  Experiment  2:  Primary  Staging  with  All-Night  Data 

In  this  experiment,  all-night  sleep  tokens  were  used  to  train  different  neural  net- 
works on  classification  of  six  sleep  stages.  The  five  selected  records  were  also  used 
here.  And  the  leave-one-out  experimental  methodology  was  utilized  to  rigorously 
test  the  nets.  The  reason  of  using  real  token  data  instead  of  artificial  patterns  as  the 
training  set  is  that  real  token  data  may  provide  the  net  with  more  information.  The 
MLP  and  the  KBCNN  are  compared  for  learning  the  all-night  data.  The  fuzzification 
layer  was  added  to  the  nets  in  the  continuous  case  to  demonstrate  the  effectiveness 
of  adaptive  fuzzy  membership  functions. 
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Table  6.6.  A set  of  rules  for  classification  of  six  sleep  stages 


R1 

Alpha3 

then  StageO 

R2 

SigmaO  and  (not  Alpha3) 

then  Stagel/5 

R3 

Sigmal  and  (not  Alpha3) 

then  Stagel/5 

R4 

Stagel/5  and  REMO 

then  Stagel 

R5 

Stagel/5  and  (not  REMO) 

then  Stage5 

R6 

DeltaO  and  Sigma2 

then  Stage2 

R7 

Deltal  and  Sigma2 

then  Stage2 

R8 

DeltaO  and  Sigma3 

then  Stage2 

R9 

Deltal  and  Sigma3 

then  Stage2 

RIO 

Delta2 

then  Stage3 

Rll 

Delta3 

then  Stage4 

The  minute  tokens  of  the  six  relevant  events  can  be  used  directly  as  the  input  to 
the  neural  networks.  This  is  referred  to  as  the  continuous  case.  Or  each  event  can 
be  discretized  into  four  levels  according  to  Table  6.2.  This  is  called  the  discrete  case, 
which  is  composed  of  24  input  variables.  The  desired  output  of  this  experiment  was 
obtained  by  modifying  the  human  scoring.  All  epochs  of  Stage  REM  with  no  REM 
activity  were  changed  to  Stage  1 since  the  decision  of  such  epochs  requires  context  in- 
formation. This  context  information  can  be  learned  by  the  context-processing  neural 
model. 

The  set  of  rules  in  Table  6.4  are  augmented  to  the  ones  in  Table  6.6  for  clas- 
sification of  the  six  sleep  stages.  Context  among  sleep  epochs  is  not  considered  in 
these  rules.  The  numbers  associated  with  the  events  (0,  1,  2,  and  3)  correspond  to 
the  four  discretized  levels  in  the  discrete  case,  and  to  the  fuzzy  levels  - SCARCE, 
LOW,  MEDIUM,  and  HIGH  - in  the  continuous  case.  When  the  rules  are  used  as 
the  rule  base  of  a simple  expert  system,  and  a simple  pattern  matcher  is  used  as  the 
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Table  6.7.  Comparison  of  neural  networks  on  classification  of  six  sleep  stages 


Neural  Net 

Training 
Set  (%) 

Test 
Set  (%) 

Weights 
Sz  Biases 

Hidden 

Units 

Fuzziness 

Units 

Discrete  Case 

MLP1 

84.14 

80.14 

626 

20 

0 

KBCNN 

82.70 

79.34 

104 

14 

0 

Continuous  Case 

MLP1 

84.88 

78.36 

331 

25 

0 

MLP2 

83.86 

76.02 

394 

16  - 12 

0 

FMLP 

86.18 

83.08 

508 

10 

36 

FKBCNN 

86.36 

83.70 

200 

14 

36 

inference  engine,  the  expert  system  performs  at  only  56.15%  accuracy  on  the  five 
selected  records. 

The  rule  set  in  Table  6.6  was  used  to  construct  a KBCNN  for  the  discretized  case, 
and  a fuzzy  KBCNN,  i.e.,  a KBCNN  with  a fuzzification  layer,  for  the  continuous 
case.  Two  extra  hidden  units  were  added  to  the  two  KBCNNs  with  fully  connections 
to  the  units  of  adjacent  layers,  i.e.,  the  input  and  output  layers  of  the  KBCNN;  the 
fuzzification  and  output  layers  of  the  fuzzy  KBCNN.  The  sigmoid  activation  functions 
were  used  in  the  KBCNNs. 

Different  neural  network  models  are  compared  in  Table  6.7  for  the  primary  staging 
of  sleep  using  all-night  records  as  the  training  set.  MLP1  and  MLP2  correspond  to 
multilayer  perceptrons  with  one  and  two  hidden  layers,  respectively.  The  leave-one- 
out  experiment  repeated  five  times  for  each  network.  The  average  percentages  on  the 
training  set  and  the  test  set  are  reported.  In  addition,  the  number  of  weights  and 
biases,  i.e.,  free  parameters,  and  the  numbers  of  hidden  units  and  fuzziness  units  of 
the  nets  are  compared.  The  fuzziness  units  in  the  fuzzification  layer  are  also  hidden 
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units.  But  because  of  their  specially  designed  connections  to  the  input  units,  they 
are  counted  separately  from  other  hidden  units  here. 

The  KBCNN  and  the  MLP  performed  at  the  same  level  for  the  discrete  case. 
In  the  case  of  continuous  input,  the  test  results  of  the  MLPs,  with  either  one  or 
two  hidden  layers,  were  a bit  worse  than  the  ones  in  the  discrete  case.  On  the 
other  hand,  KBCNN  and  MLP  with  the  fuzzification  layer  (the  FKBCNN  and  the 
FMLP)  performed  better,  on  both  the  training  set  and  the  test  set,  than  all  the 
above-mentioned  neural  networks. 

Fuzzy  membership  functions  inherent  in  the  fuzzification  layer  before  and  after 
training  are  shown  in  Figure  6.1.  The  presented  examples  are  the  membership  func- 
tions for  the  alpha  tokens  and  the  beta  tokens.  The  boundaries  of  the  membership 
functions  before  training  are  the  same  as  those  of  the  discretization  levels.  The  mem- 
bership functions  are  tuned  through  adaptation  of  the  center  and  the  gain  of  the 
sigmoid  functions,  which  constitute  the  membership  functions,  in  the  fuzzification 
layer. 

The  fuzzification  layer  provides  a better  representation  of  the  token  data.  In  the 
continuous  case,  the  representation  of  the  input  data  is  not  considered.  Raw  data 
are  fed  into  the  neural  networks  directly,  which  causes  difficulty  in  training.  When 
the  token  data  are  discretized,  some  information  in  the  token  data  is  lost.  Also,  the 
number  of  inconsistent  or  ambiguous  input-output  pairs  increases  significantly. 

To  measure  the  “inconsistency”  among  the  data,  an  input-output  pair  is  defined 
as  being  “inconsistent”  if  any  other  pair  has  the  same  input  pattern  but  a different 
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desired  output.  The  “maximum  inconsistency”  ( Imax ) is  then  defined  by 

I max  = X 100(%)  (6.1) 

where  jV,-  is  the  total  number  of  inconsistent  instances  and  N is  the  total  number  of 
instances.  This  is  the  case  when  all  inconsistent  instances  are  misclassified. 

Assume  that  all  inconsistent  instances  of  the  same  input  pattern  are  classified 
into  the  class  sharing  the  maximal  portion  of  these  instances.  This  is  the  case  when 
error  caused  by  inconsistency  is  minimized.  So,  the  “minimum  inconsistency” 
is  defined  by 


7mm  — 


Ar«  ~ Em  max/(P,m) 
N 


x 100(%) 


(6.2) 


Here,  m indexes  inconsistent  input  patterns,  l indicates  sleep  stages  0, 1, ...  ,5,  and 
P(m  is  the  number  of  instances  belonging  to  Stage  / in  the  data  cluster  of  inconsistent 
input  pattern  m.  For  example,  assume  20  input-output  pairs  are  of  the  same  incon- 
sistent input  pattern,  and  4,  9,  0,  0,  0,  and  7 of  them  belong  to  Stages  0,  1,  2,  3,  4, 
and  5,  respectively.  In  the  optimal  case,  this  pattern  is  classified  into  Stage  1.  So, 
the  maximum  number  of  Pi  here  is  9,  which  should  be  subtracted  from  the  number 
of  inconsistent  instances  to  calculate  the  minimum  inconsistency. 

The  values  of  the  maximum  and  minimum  inconsistency  of  the  training  sets  in  the 
leave-one-out  experiments  of  the  five  selected  records  are  calculated  and  averaged. 
Both  the  original  token  data  and  the  discretized  token  data  are  analyzed  and  shown 
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Table  6.8.  Inconsistency  of  sleep  tokens 


Tokens 

Maximum 

Minimum 

Continuous 

12.12% 

3.44% 

Discretized 

75.36% 

13.17% 

in  Table  6.8.  Since  the  minimum  inconsistency  of  the  discretized  input  case  is  13.17%, 
the  best  performance  that  a training  set  can  achieve  is  only  86.83%.  This  explains  the 
low  classification  accuracy  on  the  training  sets  in  the  discrete  case.  In  the  continuous 
case,  the  patterns  are  much  more  complicated  and  diverse.  So  it  is  also  difficult  to 
reach  high  accuracy  for  the  training  sets. 

In  sum,  all-night  token  data  are  used  as  the  training  data  of  the  neural  network  in 
this  experiment.  The  experimental  results  show  that  the  addition  of  the  fuzzification 
layer  is  able  to  improve  the  performance  of  the  KBCNN  and  the  MLP. 

6.5  Experiment  3:  Context  Analysis  of  Sleep 

In  this  experiment,  the  extended  gamma  model  presented  in  Chapter  4 was  used 
to  learn  the  context  of  sleep.  The  best  results  from  previous  experiments  on  primary 
staging  were  further  processed  to  incorporate  context  information.  The  primary 
results  of  the  MLP  in  Experiment  1 (Table  6.5)  and  the  FKBCNN  in  Experiment  2 
(Table  6.7)  were  chosen  for  testing  the  context-learning  capability  of  the  extended 
gamma  model. 

The  output  of  the  MLP  is  the  four  sleep  classes  which  can  be  further  distinguished 
into  six  stages  by  thresholding  the  REM  and  the  delta  wave.  Stage  1/5  is  classified 
into  Stage  5 when  the  REM  is  present;  and  Stage  1,  otherwise.  Stage  3/4  is  classified 
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Table  6.9.  Comparison  of  primary  and  final  sleep  staging  results 


Record 

Primary  Staging 

Final  Staging 

MLP 

FKBCNN 

MLP 
(6  stages) 

MLP 

(4C +<5-1- REM) 

FKBCNN 
(6  stages) 

10109 

64.96 

63.68 

86.32 

86.11 

87.61 

11623 

78.09 

80.10 

91.44 

90.43 

89.92 

11717 

67.22 

67.93 

84.32 

84.32 

85.04 

10067 

64.81 

63.92 

73.50 

72.61 

71.71 

11740 

73.56 

74.25 

83.45 

87.36 

82.99 

Average 

69.73 

69.98 

83.81 

84.17 

83.45 

into  Stage  3 when  the  delta  activity  is  less  than  30  seconds;  and  Stage  4,  otherwise. 
This  thresholding  results  can  be  used  directly  as  the  input  to  the  focused  gamma  net, 
and  so  does  the  output  of  the  FKBCNN.  Both  of  them  are  primary  stages  which  do 
not  include  any  context  information. 

The  four  sleep  classes  along  with  the  delta  activity  and  the  REM  can  also  be  used 
as  the  input  to  the  focused  gamma  net.  This  is  even  closer  to  what  human  experts 
seem  to  do.  Both  the  delta  and  the  REM  were  normalized  into  the  range  of  [0,  1] 
since  the  four  sleep  classes  were  coded  in  0 and  1.  The  normalization  can  result  in 
faster  convergence  in  the  training  procedure. 

The  results  of  the  primary  staging  from  previous  experiments  and  the  results  of 
the  final  staging  by  the  extended  gamma  model  are  presented  in  Table  6.9.  Only  the 
classification  percentages  of  the  test  records  are  shown  here.  The  primary  results  of 
the  MLP  were  the  six  stages  from  thresholding  the  delta  and  the  REM  on  the  four- 
class  results.  The  primary  results  of  the  FKBCNN  were  obtained  from  comparing 
the  output  with  the  human  scoring  (the  “target”  desired  output)  rather  than  the 
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modified  desired  output.  Recall  that  the  desired  output  of  the  FKBCNN  for  training 
was  modified  from  the  human  scoring  to  deliberately  ignore  the  context  needed  for 
the  decision  of  the  REM  stages. 

The  average  percentage  of  the  primary  staging,  which  was  about  70%,  was  im- 
proved to  about  84%  of  the  final  staging  by  the  extended  gamma  model.  All  the 
three  cases,  in  terms  of  the  input  to  the  focused  gamma  net  - the  primary  stages 
from  the  MLP,  the  four  sleep  classes  from  the  MLP  plus  the  delta  and  the  REM, 
and  the  primary  stages  from  the  FKBCNN,  performed  at  the  same  level.  The  84% 
performance  is  quite  good  considering  that  the  leave-one-out  method  is  employed. 

The  average  percentage  of  the  final  staging  on  the  training  records  was  over  95%. 
This  proves  that  the  extended  gamma  model  can  learn  the  sleep  context  very  well. 
The  extended  gamma  model  can  even  correct  some  errors  produced  in  the  primary 
staging.  This  ability  raises  the  classification  accuracy  on  the  training  sets.  But  since 
this  error  correction  is  not  really  the  context  of  sleep,  it  may  cause  erroneous  response 
on  the  test  sets  and  lower  the  classification  accuracy  on  the  test  sets.  Therefore,  the 
extended  gamma  model  can  only  be  used  to  compensate  the  lack  of  context  in  the 
primary  staging.  If  the  primary  staging  percentage  is  too  low,  the  extended  gamma 
model  would  not  work  well. 

The  memory  parameters  (/q’s  and  iVs)  are  adaptive  through  the  training  proce- 
dure. The  initial  values  of  all  memory  parameters  were  set  to  0.9.  The  average  final 
(adapted)  values  of  the  memory  parameter  of  each  memory  bank  from  the  case  of  the 
MLP  with  six  primary  stages  are  listed  in  Table  6.10  along  with  their  corresponding 
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Table  6.10.  Average  values  of  memory  parameters  and  their  corresponding  memory 
depths 


Causal  Part  (/q’s) 

Stages 

0 

1 

2 

3 

4 

REM 

Memory 

Parameters 

0.65 

0.44 

0.97 

0.95 

0.89 

0.52 

Memory  Depths 
(minutes) 

7.69 

11.36 

5.15 

5.26 

5.62 

9.61 

Anticausal  Part  (i/,-’s) 

Memory 

Parameters 

0.82 

0.54 

0.93 

0.86 

0.97 

0.95 

Memory  Depths 
(minutes) 

6.10 

9.26 

5.38 

5.81 

5.15 

5.26 

memory  depths.  In  this  case,  each  bank  is  associated  with  a given  sleep  stage.  The 
values  of  //,’ s and  i/i's  adapted  to  the  values  that  minimize  the  output  mean  square 
error.  Each  sleep  stage  requires  a different  context,  ranging  from  10  to  20  minutes. 
As  expected,  Stages  1 and  REM  need  more  context  information  for  the  final  decision 
than  the  other  stages.  Notice  that  the  memory  orders  of  both  the  causal  part  and 
the  anticausal  part  were  5.  So  the  value  of  required  context  for  each  stage  in  this 
experiment  was  greater  than  10  minutes. 

When  the  best  performance  of  each  record  is  picked  from  the  the  three  cases  of  the 
final  staging  in  Table  6.9,  the  average  percentage  is  85%.  This  result  is  at  the  same 
level  as  the  ones  reported  in  Principe  et  al.  [1993].  However,  the  neural  networks  are 
examined  by  cross-validation  in  this  work.  The  results  are  more  solid  than  previous 


ones. 
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6.6  Experiment  4:  Test,  on  Other  Records 

In  this  experiment,  the  five  selected  records  were  used  as  the  training  set,  and 
the  remainder  was  used  as  the  test  set.  Recall  that  the  five  selected  records  cover 
different  age  groups.  Using  them  as  the  training  set  makes  the  neural  networks  learn 
different  sleep  patterns  although  not  all  possible  ones. 

Two  methods  were  tried:  (1)  the  KBCNN  with  a fuzzification  layer  trained  on 
all-night  data  plus  the  extended  gamma  model  trained  on  the  six  primary  stages;  and 
(2)  the  MLP  trained  on  the  24  artificial  patterns  plus  the  extended  gamma  model 
trained  on  the  four  sleep  classes,  normalized  delta  wave  time,  and  normalized  REM 
count.  The  desired  output  of  the  FKBCNN  for  training  was  the  modified  human  sleep 
scoring.  The  trained  FKBCNN  was  then  tested  on  the  records  with  both  the  modified 
and  unmodified  desired  outputs.  The  experiment  results  are  shown  in  Table  6.11. 

The  performance  of  the  final  staging  of  the  all-night  data  case  is  better  than 
that  of  the  artificial-pattern  case  for  all  training  and  test  records.  For  the  all-night 
data  case,  the  average  percentage  was  about  95%  on  the  training  records  and  82.2% 
on  the  test  records.  The  results  on  the  test  records  are  comparable  to  the  ones  in 
Chang  [1987],  which  averaged  83.7%.  The  results  of  four  test  records  were  better  than 
Chang’s;  one  was  a little  bit  worse;  and  two  were  much  worse.  Two  of  the  records 
- 11747  and  10889  - had  much  poorer  performance  than  the  others.  They  degraded 
the  average  percentage  dramatically.  It  can  be  said  that  the  two  records  have  very 
different  token  patterns  and  sleep  stage  profiles  than  the  others.  The  percentages  of 
these  two  records  were  already  much  worse  than  the  others  on  the  primary  staging. 


87 


Table  6.11.  Results  of  training  on  selected  records  and  testing  on  the  rest 


All-Night  Data  (FKBCNN) 

Artificial  Patterns  (MLP) 

Mod.  O/P 

Unmod.  O/P 

Final 

4 Classes 

Final 

T 

10109 

88.7 

68.6 

95.1 

87.0 

94.0 

R 

11623 

92.4 

80.9 

95.2 

94.5 

91.2 

A 

11717 

88.4 

71.0 

94.1 

89.3 

90.0 

I 

10067 

80.6 

69.5 

95.3 

76.4 

93.8 

N 

11740 

85.1 

75.9 

94.3 

82.1 

94.3 

Average 

87.0 

73.2 

94.8 

85.9 

92.7 

10158 

87.0 

80.9 

89.7 

83.4 

79.8 

10146 

80.6 

71.7 

85.8 

91.6 

83.6 

T 

11769 

89.8 

67.7 

91.3 

89.0 

87.9 

E 

11747* 

80.1 

62.1 

74.3 

71.6 

62.1 

S 

10244 

84.7 

67.0 

83.9 

84.2 

82.1 

T 

11771 

82.3 

81.6 

83.2 

86.1 

80.5 

10889* 

66.4 

63.7 

67.1 

62.1 

55.4 

Average 

81.6 

70.7 

82.2 

81.1 

75.9 

This  is  why  the  extended  gamma  model  cannot  produce  a high  percentage  on  the 
final  staging. 

To  further  investigate  the  poor  performance  on  Records  11747  and  10889,  the 
“similarity”  between  the  test  records  and  the  five  training  records  was  examined.  The 
similarity  of  each  test  record  to  the  five  training  records  is  defined  as  the  percentage 
of  input  patterns  and  input-output  pairs  of  the  test  records,  which  can  be  found  in 
the  five  training  records.  The  results  are  shown  in  Table  6.12.  As  expected,  Records 
11747  and  10889  have  the  lowest  percentages  among  the  test  records.  This  gives  us 
a brief  idea  of  how  the  two  records  are  different  from  others. 

The  set  of  artificial  patterns  was  quite  efficient  in  training  the  neural  network 
to  classify  the  four  sleep  classes.  But  it  is  impossible  to  cover  all  possible  token 
patterns  by  this  limited  number  of  artificial  patterns.  Three  of  the  records  - 10109, 
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Table  6.12.  Similarity  of  the  test  records  to  the  five  selected  training  records 


Record 

Input  patterns  (%) 

I/O  pairs  (%) 

10158 

30.73 

28.44 

10146 

42.71 

36.93 

11769 

43.74 

40.34 

11747* 

13.77 

9.03 

10244 

38.92 

33.49 

11771 

22.42 

19.28 

10889* 

1.38 

1.38 

11747  and  10889  - had  very  low  percentage  in  the  classification  of  the  four  sleep 
classes.  When  the  extended  gamma  model  was  used  for  context  processing,  one  of 
them  was  in  the  training  set,  and  the  other  two  were  in  the  test  set.  This  demoted 
the  overall  performance.  The  same  problem  existed  when  all-night  data  were  used  for 
the  primary  staging.  The  three  records  also  had  poorer  performance  in  the  all-night 
data  case.  But  by  using  all-night  data  as  the  training  data,  the  limitation  of  artificial 
patterns  was  somewhat  relieved. 

6.7  Experiment  5:  Record  Grouping  by  Age 

It  was  mentioned  in  Chapter  5 that  the  intra-  and  inter-subject  variability  is  the 
major  challenge  for  the  sleep  staging  problem.  To  narrow  down  the  scope,  we  may 
divide  the  records  into  groups  by  the  ages  of  the  subjects.  It  is  expected  that  this 
would  reduce  the  intersubject  variability  of  sleep  patterns  within  each  group. 

The  12  records  were  divided  into  three  age  groups  as  shown  in  Table  6.1.  Within 
each  group,  the  leave-one-out  experiments  were  performed.  The  overall  performance 
was  not  as  good  as  expected.  The  average  percentage  on  the  final  staging  was  85.8% 
for  Group  I,  81.9%  for  Group  II,  and  70.0%  for  Group  III.  These  results  are  poorer 
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than  the  ones  from  previous  experiments.  There  are  two  explanations  for  the  poor 
performance.  First,  there  were  only  three  records  in  the  training  set  compared  to 
four  or  five  records  in  previous  experiments.  Although  the  records  were  in  the  same 
age  group,  a sufficient  number  of  records  is  still  needed  in  the  training  set  in  order 
to  cover  more  sleep  patterns.  Secondly,  Records  11747  and  10889  still  caused  trouble 
in  Groups  II  and  III,  respectively.  The  classification  percentage  of  other  records 
deteriorated  when  either  one  of  them  was  in  the  training  set.  However,  the  percentage 
of  the  two  records  improved  because  the  training  records  were  of  their  age  groups 
and  had  more  similar  sleep  patterns  than  the  records  of  other  groups. 

In  sum,  dividing  records  into  different  age  groups  is  a way  to  reduce  the  inter- 
subject variability.  Nevertheless,  enough  records  are  needed  in  the  training  set  to 
cover  more  sleep  patterns.  Grouping  records  according  to  their  similarity  is  certainly 
a means  to  improve  the  performance.  Other  criteria  besides  age  may  be  found  and 
used  for  such  a purpose. 

6.8  Discussion 

The  24  artificial  training  patterns  based  on  human  knowledge  are  quite  efficient 
in  training  the  neural  network  to  classify  the  four  sleep  classes.  But  because  the 
electrophysiological  signal  is  unconstrained,  it  is  impossible  for  the  artificial  set  to 
cover  all  possible  patterns.  In  addition  to  human  knowledge,  statistical  methods  can 
also  be  used  to  analyze  sleep  records  in  order  to  compose  a better  artificial  training 
set.  Until  better  artificial  training  patterns  can  be  found,  carefully  selected  sleep 
records  should  be  used  as  the  training  set  for  the  primary  staging. 
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There  are  some  problems  with  all-night  data  as  the  training  set  for  the  primary 
staging.  First,  the  distribution  of  the  six  stages  is  not  uniform.  This  can  be  seen 
from  Table  6.1.  The  neural  networks  have  difficulty  in  learning  stages  occupying  only 
a small  portion  of  sleep,  e.g.,  Stage  3.  To  improve  this  situation,  statistical  analysis 
is  necessary  to  come  up  with  an  abstract  set  of  training  data  from  the  all-night 
data.  Within  the  abstract  training  set,  sleep  stages  should  be  uniformly  distributed. 
Secondly,  a sleep  stage  usually  lasts  for  a number  of  epochs.  Training  on  a sequence 
of  epochs  of  the  same  stage  might  cause  “overtraining”  on  this  particular  stage  before 
the  net  receives  data  of  other  stages.  One  way  to  relieve  this  problem  is  to  reduce 
the  learning  rate.  The  abstract  training  set  can  also  help  on  this  problem. 

One  major  issue  of  neural  network  training  is  when  to  stop  the  training  procedure 
in  order  to  achieve  the  maximum  generalization  capability.  In  the  experiments,  the 
mean  square  error  (MSE)  was  used  as  the  clue  to  stop  the  training  procedure.  When 
the  MSE  was  below  a preset  value,  or  when  the  MSE  fluctuation  was  very  small,  the 
training  procedure  stopped.  The  criteria  are  heuristic  and  do  not  guarantee  optimal 
generalization  capability.  However,  they  set  a standard  for  stopping  the  training 
procedure  in  all  the  experiments.  The  results  reported  in  the  experiments  are  not 
the  “best”  we  can  get.  The  best  performance  can  only  be  obtained  when  the  test 
set  percentage  is  monitored.  But  it  is  impractical  to  do  so.  Better  stop  criteria  for 
training  with  sound  theoretical  background  would  be  able  to  enhance  the  power  of 


the  neural  models. 
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The  minute  tokens  from  a waveform  detection  system  are  used  as  the  input  to 
the  neural  networks.  It  is  assumed  that  the  token  data  contain  all  the  information 
needed  for  sleep  staging.  The  performance  of  the  neural  networks  depends  fully  on 
the  information  in  the  tokens.  The  fuzzification  layer  provides  a better  representation 
of  the  token  information.  Thus,  the  overall  performance  of  the  KBCNN  and  the  MLP 
was  improved  by  the  annexed  fuzzification  layer. 
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(a)  Alpha  tokens:  before  training 


(b)  Alpha  tokens:  after  training 


(c)  Beta  tokens:  before  training  (d)  Beta  tokens:  after  training 


Figure  6.1.  Fuzzy  membership  functions  before  and  after  training 


CHAPTER  7 

SUMMARY,  DISCUSSIONS  AND  CONCLUSIONS 
7.1  Summary 

The  neural  network  acquires  its  knowledge  by  learning  the  data,  whereas  the  ex- 
pert system  has  knowledge  formulated  by  human  experts.  To  combine  these  two 
sources  of  knowledge,  the  knowledge-based  neural  network  was  developed  to  deal 
with  the  transformation  of  domain  knowledge  into  a neural  network  structure.  The 
rule-based  neural  network  is  a kind  of  knowledge-based  neural  networks,  which  has 
knowledge  in  the  format  of  production  rules.  Yet  previous  research  on  rule-based 
neural  networks  did  not  consider  two  important  aspects  of  human  knowledge,  i.e., 
fuzziness  and  context.  This  dissertation  is  intended  to  enhance  rule-based  neural  net- 
works by  adding  some  neural  models  to  handle  the  two  aspects  of  human  knowledge. 

A simple  fuzzification  layer  was  developed  to  fuzzify  continuous  inputs  which  can 
not  be  easily  discretized.  The  fuzzy  membership  functions  are  composed  by  the  sig- 
moid activation  functions  of  the  neurons.  The  membership  functions  can  be  refined 
by  training  connection  weights.  Furthermore,  the  fuzzification  layer  can  be  superim- 
posed onto  a rule-based  neural  network  as  well  as  any  back-propagation  networks. 
The  standard  back-propagation  training  procedure  can  be  applied  here  with  only  a 
minor  modification.  Dividing  the  continuous  input  into  several  levels  makes  neural 
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networks  learn  more  efficiently.  But  discretization  causes  a great  loss  of  information 
in  some  cases.  Thus,  the  adaptive  fuzzy  membership  functions  implemented  by  the 
fuzzification  layer  provide  a better  representation  of  the  input. 

The  gamma  model,  originally  developed  for  temporal  processing,  was  extended 
by  adding  an  anticausal  memory  structure  in  order  to  deal  with  the  context  learning 
problem.  The  gamma  model  has  a more  flexible  memory  structure  than  the  other 
memory  models.  The  memory  depth  of  the  gamma  memory  can  stretch  outside  the 
window  defined  by  the  order  of  the  memory  structure.  With  the  annexed  anticausal 
memory,  the  extended  gamma  model  can  learn  not  only  the  left  context  but  also  the 
right  context.  This  context-processing  neural  model  was  used  to  further  process  the 
output  of  a rule-based  neural  network  or  a general  feedforward  network  to  compensate 
the  lack  of  context-processing  capability  of  these  networks.  An  on-line  learning  and 
processing  scheme  was  developed  for  practical  use. 

The  sleep  staging  problem  is  the  classification  of  sleep  into  six  stages  according 
to  the  occurrence  or  the  duration  of  relevant  EEG/EOG/EMG  events.  Sleep  staging 
involves  incomplete  human  knowledge  and  a massive  number  of  data,  which  is  an 
application  domain  suitable  for  the  knowledge-based  neural  network.  Several  exper- 
iments were  designed  to  evaluate  the  performance  of  the  rule-based  neural  network, 
the  effect  of  the  fuzzification  layer  on  the  rule-based  neural  network  and  the  multi- 
layer perceptron,  and  the  context-learning  capability  of  the  extended  gamma  model. 
Also,  a neural  network-based  hybrid  system  composed  of  the  above-mentioned  neu- 
ral models  was  also  tested  on  the  sleep  staging  problem.  Sleep  records  with  subject 
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ages  ranging  from  13  to  70  were  used  in  the  experiments.  The  results  showed  that 
the  fuzzification  layer  and  the  extended  gamma  model  do  help  improving  the  overall 
performance.  And  the  hybrid  system  is  very  promising  for  the  sleep  staging  problem. 

7.2  Discussions  and  Conclusions 

The  rule-based  neural  network  (RBNN)  was  further  examined  by  comparing  with 
the  multilayer  perceptron  (MLP).  The  experimental  results  showed  that  the  perfor- 
mance of  the  RBNN  is  equal  to  or  better  than  that  of  the  MLP.  Moreover,  the  RBNN 
has  much  fewer  connections  than  the  MLP.  The  learning  and  processing  procedures 
are  more  efficient  with  the  RBNN.  Also,  the  RBNN  has  the  advantage  over  the 
MLP  when  the  training  data  are  insufficient,  because  the  RBNN  has  the  transformed 
knowledge  from  human-formulated  rules. 

The  fuzzification  layer  was  designed  to  be  superimposed  on  any  back-propagation 
neural  networks.  It  is  an  effective  way  to  handle  the  fuzzification  of  the  input  data. 
The  experimental  results  showed  that  the  fuzzification  layer  improves  the  performance 
of  the  KBCNN  and  the  MLP.  In  addition,  the  fuzzification  layer  fits  the  framework 
of  the  RBNN  because  of  the  similarity  between  the  two.  Both  the  fuzzification  layer 
and  the  RBNN  are  constructed  by  specially  designed  subnetworks,  i.e.,  neurons  and 
interconnections.  And  both  of  them  use  human  knowledge  to  decide  the  initial  values 
of  the  connection  weights. 

The  adaptive  fuzzy  membership  functions  in  the  fuzzification  layer  can  be  recov- 
ered from  the  values  of  connection  weights  and  biases  of  the  fuzzification  neurons. 
The  fuzzy  membership  functions  are  tuned  through  adaptation  of  the  weights  and 
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biases.  To  control  the  membership  functions  to  be  meaningful  under  the  definitions  of 
fuzzy  sets,  the  adaptation  of  weights  and  biases  needs  special  regulations.  However, 
a better  representation  of  the  input  could  be  obtained  through  free  adaptation  of  the 
weights  and  biases,  although  it  might  not  follow  the  fuzzy  set  theory. 

The  gamma  model  was  extended  to  suit  the  context  learning  problem.  The  exper- 
imental results  showed  that  the  extended  gamma  model  can  convert  low- percent  age 
primary  results  into  high-percentage  final  results.  It  does  not  only  learn  the  context 
information  but  also  corrects  errors  of  the  primary  results.  Furthermore,  the  gamma 
model  is  used  for  symbolic  processing,  which  has  never  been  tried  before. 

There  are  two  major  problems  in  the  information  processing  of  sleep  tokens.  First, 
the  quality  of  the  tokens  (the  input  to  the  system)  is  not  assured.  Although  the  wave- 
form detection  system  can  have  over  90%  accuracy,  still  the  single  digit  difference  in 
percentage  causes  false  information  in  the  tokens.  Secondly,  the  human  sleep  scoring 
(the  desired  output  of  the  system)  is  not  always  reliable  because  of  the  imprecision 
of  humans  in  scoring  the  sleep  stages.  These  two  problems  refrain  all  information 
processing  systems  from  the  desired  high-percentage  performance  on  sleep  staging. 

To  enhance  the  overall  performance  of  information  processing  systems  on  the 
sleep  staging  problem,  the  quality  of  the  input  and  the  desired  output  needs  to  be 
improved.  Neural  networks  can  be  tried  to  conquer  the  task  of  waveform  detection 
for  EEG/EOG/EMG  signals.  Only  when  the  waveform  information  in  the  tokens  has 
near  100%  accuracy,  the  sleep  staging  accuracy  can  be  over  90%.  The  human  sleep 
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scoring  also  needs  extensive  check  by  the  experts  before  being  used  as  the  desired 
output. 

The  hybrid  information  processing  system  includes  the  rule-based  neural  network 
with  the  fuzzification  layer  and  the  extended  gamma  model.  This  system  utilizes 
both  the  knowledge  of  a certain  problem  domain  and  the  knowledge  inherent  in  the 
data.  Also,  it  takes  the  fuzzy  nature  of  human  knowledge  and  the  context  information 
in  the  data  into  consideration.  The  sleep  staging  problem  is  well  covered  under  this 
framework.  The  hybrid  system  can  also  be  applied  to  other  information  processing 
problems  with  similar  characteristics. 
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SLEEP  STAGE  PROFILES 
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Figure  0.2.  Record  10158 
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Figure  0.3.  Record  10146 
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Figure  0.4.  Record  11623 
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Figure  0.5.  Record  11769 
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Figure  0.6.  Record  11717 
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Figure  0.7.  Record  11747 
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Figure  0.8.  Record  10244 
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Figure  0.9.  Record  10067 
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Figure  0.10.  Record  11771 
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Figure  0.11.  Record  10889 
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Figure  0.12.  Record  11740 
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